How can I sort a string list based on edit distance efficiently?

Question

I'm trying to sort a list by edit distance using the levenshtein distance.

def suggest(dic, word, distance, maxSugestions=5):
   list = []
   for i in range(1, 200):
       for word1 in sorted(dic):
           if distance(word1, word) == i:
               list.append(word1)
               if len(list) == maxSugestions:
                   return list

This is my current function that receives a list of string (this list has about 43 thousand strings), a word that I want to compare, a function that returns the edit distance between two strings and an integer of maxSugestions that the list should have.

This is the current distance funtion:

def levDistance(str1, str2):
   matrix = [[0 for x in range(len(str2) + 1)] for x in range(len(str1) + 1)]

   for i in range(len(str1) + 1):
       for j in range(len(str2) + 1):
           if i == 0:
               matrix[i][j] = j
           elif j == 0:
               matrix[i][j] = i
           elif str1[i-1] == str2[j-1]:
               matrix[i][j] = matrix[i-1][j-1]
           else:
               matrix[i][j] = 1 + min(matrix[i][j-1], matrix[i-1][j], matrix[i-1][j-1])    

   return matrix[len(str1)][len(str2)]

The current suggest() function works, however I need it to be optimized since it is taking too long and I don't know how I should do that. Any help is grateful. Thank you

if you calculate the same distance many times then better first create dictionary with all distances `{(word1,word2):distance, ... }` — furas, Dec 10 '20 at 11:09
in current code you run `sorted(dic)` inside `for i in range(1, 200)` so you repeate the same sorting 200 times - you could sort it only once before `for`-loop — furas, Dec 10 '20 at 11:12
when you will have dictionary `{(word1,word2):distance, ... }` then you could use function `sorted(..., key=...)` with `key` which uses `distance to sort it — furas, Dec 10 '20 at 11:26

score 1 · Answer 1 · answered Dec 10 '20 at 11:43

I tried this technique I hope this will work for you

def edit_distance(word, string_to_take_distance_with = "someString"):
    '''
    Description:
        give you the edit distance between 2 words
        word                            : String 1 (dynamic) 
        string_to_take_distance_with    : String 2 (static)
        
    '''
    
    
    length_of_string  = len(word)+1
    length_of_string2 = len(string_to_take_distance_with)+1

    tbl = {}
    for i in range(length_of_string): tbl[i,0]=i
    for j in range(length_of_string2): tbl[0,j]=j
    for i in range(1, length_of_string):
        for j in range(1, length_of_string2):
            cost = 0 if word[i-1] == string_to_take_distance_with[j-1] else 1
            tbl[i,j] = min(tbl[i, j-1]+1, tbl[i-1, j]+1, tbl[i-1, j-1]+cost)

    return tbl[i,j]




sorted(["hello","helo","aen"], key=edit_distance)

Hamza · Accepted Answer · 2020-12-10T12:01:49.123

You are calculating same distances over each iteration which is a big no no. Instead try to calculate only once and then have number of suggestions as defined by maxSuggestion:

def suggest(dic, word, distance, maxSugestions=5):
   return [i[1] for i in sorted([(distance(word1, word), word1) for word1 in dic])[:maxSuggestion]]

Then comes your implementation! If you still want this to be even faster, better would be to use editdistance library. (Or any other C-based implementation if it matters) instead of a python based implementation. For me it went 20x faster than python implementation. From the original answer:

I used a c-based implementation of calculating levenshtein distance making use of : editdistance library. On research I found that many such tasks have C-based implementations like matrix-multiplication and search algorithms etc. are readily available. Besides you can always write a module in C and make use of it in python. editdistance.eval('banana', 'bahama') took only 1.71 µs ± 289 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each) in comparison with my defined function levenshteinDistance('banana', 'bahama') which took 34.4 µs ± 4.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) That’s a 20x speedup.

P.S. I am not author of the package and found the package using a nominal google search of 'C based implementation of levenshtein distance'

How can I sort a string list based on edit distance efficiently?

2 Answers2