15

How to calculate Levenshtein Distance matrix of strings in Python ?

              str1    str2    str3    str4    ...     strn
      str1    0.8     0.4     0.6     0.1     ...     0.2
      str2    0.4     0.7     0.5     0.1     ...     0.1
      str3    0.6     0.5     0.6     0.1     ...     0.1
      str4    0.1     0.1     0.1     0.5     ...     0.6
      .       .       .       .       .       ...     .
      .       .       .       .       .       ...     .
      .       .       .       .       .       ...     .
      strn    0.2     0.1     0.1     0.6     ...     0.7

Using the distance function, we can calculate distance between 2 words. In my case, I have one list containing N number of strings. The desired result is to calculate the distance matrix and after that,do the clustering of words.

Amira Bedhiafi
  • 8,088
  • 6
  • 24
  • 60
Ajay Jadhav
  • 161
  • 1
  • 1
  • 5
  • Use NLTK `metrics` and [this](http://stackoverflow.com/questions/13166089/string-comparison-in-python-but-not-levenshtein-distance-i-think) post might be helpful to you – Niranj Rajasekaran May 25 '16 at 06:14
  • refer this https://rosettacode.org/wiki/Levenshtein_distance#Python – Tanu May 25 '16 at 06:20
  • @Tanu Its giving distance between 2 words. I want matrices for n number of words – Ajay Jadhav May 25 '16 at 06:30
  • 1
    @AjayJadhav at any point of time you will be calculating distance between two words , so you can iterate over matrix and calculate distance for each set of two words at a time and populate a new matrix – Tanu May 25 '16 at 06:38
  • @Tanu I wrote down Code for that. Thanks @ Tanu & @ Niranj Rajasekaran – Ajay Jadhav May 25 '16 at 08:01
  • Hey Ajay Jadhav and @Tanu could you share you code, how you build up that matrix. I need to build the same – default_settings Apr 29 '22 at 16:22
  • 1
    Does this answer your question? [String Distance Matrix in Python using pdist](https://stackoverflow.com/questions/46452724/string-distance-matrix-in-python-using-pdist) – evces Dec 21 '22 at 02:42

3 Answers3

7

Just use the pdist version that accepts a custom metric.

Y = pdist(X, levensthein)

and for the levensthein then you can use the implementation of rosettacode as suggested by Tanu

If you want a full squared matrix just use squareform on the result:

Y = scipy.spatial.distance.squareform(Y)
rafaelvalle
  • 6,683
  • 3
  • 34
  • 36
elabard
  • 455
  • 1
  • 5
  • 16
  • 1
    No need to write the algorithm, there are several PyPI packages, that implement it, e.g. `editdistance`, `pylev`. – Eli Korvigo May 25 '16 at 07:05
  • @elabard Pylev works for 2 words but my question is how to compue matrix pylev.levenshtein('kitten', 'sitting') 3 – Ajay Jadhav May 25 '16 at 07:59
  • 1
    is it not exactly what I have suggested? `pdist` returns a matrix by applying `levensthein` or whatever metrics you want to each pair of elements... – elabard May 25 '16 at 08:36
  • Upon passing a list of strings, `pdist` is saying "A 2-dimensional array must be passed." –  Oct 05 '16 at 18:40
  • 1
    just reshape your input with `.reshape(-1,1)` – elabard Oct 06 '16 at 14:04
  • what if you have a list of uneven strings? like this ids 0 [462423-43, 277581-25, 58545-19] 1 [0] 2 [437742-46, 228893-32, 463200-04, 227479-78, 222217-39, 462579-94, 458759-98, 438589-72, 438675-76, 265589-83, 178215-13, 433701-46, 431222-77, 433515-16] 3 [431380-63] – nerdlyfe Jul 26 '21 at 06:17
  • Nice answer, but pdist is not efficient enought for large matrix. I am looking for multiprocessing version of pdist, but I can't find it. Something like pairwise_distance, but pairwise_distance is not accepting strings. – Tedo Vrbanec Aug 25 '22 at 08:16
  • Rosetta Code implementation is a terrible choice, and list comprehensions can be ~5x faster than pdist(). See my complete answer here: https://stackoverflow.com/a/74870785/20521021 – evces Dec 21 '22 at 02:40
5

Here is my code

import pandas as pd
from Levenshtein import distance
import numpy as np

Target = ['Tree','Trip','Treasure','Nothingtodo']

List1 = Target
List2 = Target

Matrix = np.zeros((len(List1),len(List2)),dtype=np.int)

for i in range(0,len(List1)):
  for j in range(0,len(List2)):
      Matrix[i,j] = distance(List1[i],List2[j])

print Matrix

[[ 0  2  4 11]
 [ 2  0  6 10]
 [ 4  6  0 11]
 [11 10 11  0]]
pratiksha
  • 51
  • 2
  • as I suggested in my answer you don't have to do the nested for by hand...pdist does it for you and in a more efficient way since it computes only the upper triangular distances... (a distance is always symmetric) – elabard May 27 '16 at 12:12
0

You could do something like this

from Levenshtein import distance
import numpy as np
from time import time

def get_distance_matrix(str_list):
    """ Construct a levenshtein distance matrix for a list of strings"""
    dist_matrix = np.zeros(shape=(len(str_list), len(str_list)))
    t0 = time()
    print "Starting to build distance matrix. This will iterate from 0 till ", len(str_list) 
    for i in range(0, len(str_list)):
        print i
        for j in range(i+1, len(str_list)):
                dist_matrix[i][j] = distance(str_list[i], str_list[j]) 
    for i in range(0, len(str_list)):
        for j in range(0, len(str_list)):
            if i == j:
                dist_matrix[i][j] = 0 
            elif i > j:
                dist_matrix[i][j] = dist_matrix[j][i]
    t1 = time()
    print "took", (t1-t0), "seconds"
    return dist_matrix

str_list = ["analyze", "analyse", "analysis", "analyst"]
get_distance_matrix(str_list)

Starting to build distance matrix. This will iterate from 0 till  4
0
1
2
3
took 0.000197887420654 seconds
>>> array([[ 0.,  1.,  3.,  2.],
   [ 1.,  0.,  2.,  1.],
   [ 3.,  2.,  0.,  2.],
   [ 2.,  1.,  2.,  0.]])