Edit Distance in Python

Question

I'm programming a spellcheck program in Python. I have a list of valid words (the dictionary) and I need to output a list of words from this dictionary that have an edit distance of 2 from a given invalid word.

I know I need to start by generating a list with an edit distance of one from the invalid word(and then run that again on all the generated words). I have three methods, inserts(...), deletions(...) and changes(...) that should output a list of words with an edit distance of 1, where inserts outputs all valid words with one more letter than the given word, deletions outputs all valid words with one less letter, and changes outputs all valid words with one different letter.

I've checked a bunch of places but I can't seem to find an algorithm that describes this process. All the ideas I've come up with involve looping through the dictionary list multiple times, which would be extremely time consuming. If anyone could offer some insight, I'd be extremely grateful.

You might want to look at Peter Norvig's spell checker (http://norvig.com/spell-correct.html) and modify it to suit your needs. — Richard Nienaber, Mar 17 '10 at 06:08
Figure 2.15 in [this](https://web.stanford.edu/~jurafsky/slp3/2.pdf) book shows pseudo-code for Levenshtein distance. The NLTK Python package has a [function](https://www.nltk.org/api/nltk.metrics.html#nltk.metrics.distance.edit_distance) for computing this. — Renel Chesak, Jun 28 '18 at 14:00
Have you seen https://www.geeksforgeeks.org/edit-distance-dp-5/ ? — pedram bashiri, May 21 '19 at 17:02

score 78 · Answer 1 · edited Nov 26 '20 at 00:29

78

The thing you are looking at is called an edit distance and here is a nice explanation on wiki. There are a lot of ways how to define a distance between the two words and the one that you want is called Levenshtein distance and here is a DP (dynamic programming) implementation in python.

def levenshteinDistance(s1, s2):
    if len(s1) > len(s2):
        s1, s2 = s2, s1

    distances = range(len(s1) + 1)
    for i2, c2 in enumerate(s2):
        distances_ = [i2+1]
        for i1, c1 in enumerate(s1):
            if c1 == c2:
                distances_.append(distances[i1])
            else:
                distances_.append(1 + min((distances[i1], distances[i1 + 1], distances_[-1])))
        distances = distances_
    return distances[-1]

And a couple of more implementations are here.

edited Nov 26 '20 at 00:29

0x90

39,472
36
165
245

answered Sep 14 '15 at 06:52

Salvador Dali

214,103
147
703
753

7

DP standing for dynamic programming. – Nikana Reklawyks Oct 19 '17 at 06:20
@Salvador Dali Shouldn't adjacent transposition return a distance of 1? The above function gives `levenshteinDistance("abc","bac")` --> 2 – alancalvitti Jan 22 '19 at 21:53
1

@alancalvitti The only operations are insert, delete and substitute. So in you example you need to delete the `a` and then reinsert it between the `b` and `c`. Two operations. – Simd Apr 17 '19 at 13:06
Hello :), do you have by chance a version of it (or of another algorithm) where you can specify an upper bound at the levenshtein distance? (You may check also my post for more details: https://stackoverflow.com/questions/59686989/levenshtein-distance-with-bound) – Outcast Feb 06 '20 at 16:37

ryanjdillon · Answer 2 · 2021-08-23T09:03:40.897

30

difflib in the standard library has various utilities for sequence matching, including the get_close_matches method that you could use. It uses an algorithm adapted from Ratcliff and Obershelp.

From the docs

>>> from difflib import get_close_matches
>>> get_close_matches('appel', ['ape', 'apple', 'peach', 'puppy'])
['apple', 'ape']

edited Aug 23 '21 at 09:03

answered Apr 29 '19 at 15:02

ryanjdillon

17,658
9
85
110

...except that this is not the edit/Levenshtein distance. – MERose Sep 14 '21 at 10:53
6

@MERose My answer is aimed at giving a solution to the problem that the question presents for other looking to solve the same problem-- and not to answer it per what appears to be a homework assignment's specifications. – ryanjdillon Sep 14 '21 at 11:32
2

This is amazing! – dantiston Nov 10 '21 at 21:03

score 8 · Answer 3 · edited Sep 15 '15 at 05:14

#this calculates edit distance not levenstein edit distance
word1="rice"

word2="ice"

len_1=len(word1)

len_2=len(word2)

x =[[0]*(len_2+1) for _ in range(len_1+1)]#the matrix whose last element ->edit distance

for i in range(0,len_1+1): #initialization of base case values

    x[i][0]=i
for j in range(0,len_2+1):

    x[0][j]=j
for i in range (1,len_1+1):

    for j in range(1,len_2+1):

        if word1[i-1]==word2[j-1]:
            x[i][j] = x[i-1][j-1] 

        else :
            x[i][j]= min(x[i][j-1],x[i-1][j],x[i-1][j-1])+1

print x[i][j]

score 8 · Answer 4 · answered Jun 11 '14 at 20:56

8

Here is my version for Levenshtein distance

def edit_distance(s1, s2):
    m=len(s1)+1
    n=len(s2)+1

    tbl = {}
    for i in range(m): tbl[i,0]=i
    for j in range(n): tbl[0,j]=j
    for i in range(1, m):
        for j in range(1, n):
            cost = 0 if s1[i-1] == s2[j-1] else 1
            tbl[i,j] = min(tbl[i, j-1]+1, tbl[i-1, j]+1, tbl[i-1, j-1]+cost)

    return tbl[i,j]

print(edit_distance("Helloworld", "HalloWorld"))

answered Jun 11 '14 at 20:56

Santosh

328
5
9

Your code is incorrect. Logic is unsound and `return tbl[i,j]` should be `return tbl[n,m]`. – Artur Krajewski Dec 14 '18 at 21:19

krassowski · Answer 5 · 2021-04-21T18:51:21.903

Using the SequenceMatcher from Python built-in difflib is another way of doing it, but (as correctly pointed out in the comments), the result does not match the definition of an edit distance exactly. Bonus: it supports ignoring "junk" parts (e.g. spaces or punctuation).

from difflib import SequenceMatcher

a = 'kitten'
b = 'sitting'

required_edits = [
    code
    for code in (
        SequenceMatcher(a=a, b=b, autojunk=False)
        .get_opcodes()
    )
    if code[0] != 'equal'
]
required_edits
# [
#    # (tag, i1, i2, j1, j2)
#    ('replace', 0, 1, 0, 1), # replace a[0:1]="k" with b[0:1]="s"
#    ('replace', 4, 5, 4, 5), # replace a[4:5]="e" with b[4:5]="i"
#    ('insert', 6, 6, 6, 7),  # insert b[6:7]="g" after a[6:6]="n"
# ]


# the edit distance:
len(required_edits)  # == 3

Running this for `'123'` and `'12345'` produces a `required_edits` of length 1, while the edit distance is 2. — Russ, Oct 14 '20 at 21:39

score 6 · Answer 6 · answered Oct 12 '21 at 19:02

I would recommend not creating this kind of code on your own. There are libraries for that.

For instance the Levenshtein library.


In [2]: Levenshtein.distance("foo", "foobar")
Out[2]: 3

In [3]: Levenshtein.distance("barfoo", "foobar")
Out[3]: 6

In [4]: Levenshtein.distance("Buroucrazy", "Bureaucracy")
Out[4]: 3

In [5]: Levenshtein.distance("Misisipi", "Mississippi")
Out[5]: 3

In [6]: Levenshtein.distance("Misisipi", "Misty Mountains")
Out[6]: 11

In [7]: Levenshtein.distance("Buroucrazy", "Born Crazy")
Out[7]: 4

J.Michael · Answer 7 · 2021-08-27T07:51:19.820

1

Similar to Santoshi's solution above but I made three changes:

One line initialization instead of five
No need to define cost alone (just use int(boolean) 0 or 1)
Instead of double for loop use product, (this last one is only cosmetic, double loop seems unavoidable)

from itertools import product

def edit_distance(s1,s2):      
   d={ **{(i,0):i for i in range(len(s1)+1)},**{(0,j):j for j in range(len(s2)+1)}}
   for i, j in product(range(1,len(s1)+1), range(1,len(s2)+1)): 
       d[i,j]=min((s1[i-1]!=s2[j-1]) + d[i-1,j-1], d[i-1,j]+1, d[i,j-1]+1)
   return d[i,j]

edited Aug 27 '21 at 07:51

answered Nov 12 '20 at 14:16

J.Michael

41
7

You are referencing a variable `s` (*i.e.*, `s[i-1]!=s[j-1]`) which is not defined in your code. – Glenn Slayden Aug 26 '21 at 06:50
Thanks made the correction! Also noted that in Python True + n = n+1, and False + n =n so no need to explicitly convert to type 'int' – J.Michael Aug 27 '21 at 07:52

score 0 · Answer 8 · edited Apr 01 '17 at 13:13

0

Instead of going with Levenshtein distance algo use BK tree or TRIE, as these algorithms have less complexity then edit distance. A good browse over these topic will give a detailed description.

This link will help you more about spell checking.

edited Apr 01 '17 at 13:13

pacholik

8,607
9
43
55

answered Apr 01 '17 at 12:54

jt26

21
3

Inaam Ilahi · Answer 9 · 2019-01-15T14:23:56.707

You need Minimum Edit Distance for this task.

Following is my version of MED a.k.a Levenshtein Distance.

def MED_character(str1,str2):
    cost=0
    len1=len(str1)
    len2=len(str2)

    #output the length of other string in case the length of any of the string is zero
    if len1==0:
        return len2
    if len2==0:
        return len1

    accumulator = [[0 for x in range(len2)] for y in range(len1)] #initializing a zero matrix

    # initializing the base cases
    for i in range(0,len1):
        accumulator[i][0] = i;
    for i in range(0,len2):
        accumulator[0][i] = i;

    # we take the accumulator and iterate through it row by row. 
    for i in range(1,len1):
        char1=str1[i]
        for j in range(1,len2):
            char2=str2[j]
            cost1=0
            if char1!=char2:
                cost1=2 #cost for substitution
            accumulator[i][j]=min(accumulator[i-1][j]+1, accumulator[i][j-1]+1, accumulator[i-1][j-1] + cost1 )

    cost=accumulator[len1-1][len2-1]
    return cost

Zac S.Y · Answer 10 · 2022-04-25T05:17:37.790

Fine tuned codes based on the version from @Santosh and should address the issue brought up by @Artur Krajewski; The biggest difference is replacing an effective 2d matrix


def edit_distance(s1, s2):
# add a blank character for both strings
    m=len(s1)+1
    n=len(s2)+1
# launch a matrix
    tbl = [[0] * n for i in range(m)] 
    for i in range(m): tbl[i][0]=i
    for j in range(n): tbl[0][j]=j

    for i in range(1, m):
        for j in range(1, n):
#if strings have same letters, set operation cost as 0 otherwise 1
            cost = 0 if s1[i-1] == s2[j-1] else 1
#find min practice
            tbl[i][j] = min(tbl[i][j-1]+1, tbl[i-1][j]+1, tbl[i-1][j-1]+cost)
    return tbl

edit_distance("birthday", "Birthdayyy")

score 0 · Answer 11 · answered Jun 08 '22 at 16:33

following up on @krassowski's answer

from difflib import SequenceMatcher

def sequence_matcher_edits(word_a, word_b):
  required_edits = [code for code in (
      SequenceMatcher(a=word_a, b=word_b, autojunk=False).get_opcodes()
    )
    if code[0] != 'equal'
  ]
  return len(required_edits)

print(f"sequence_matcher_edits {sequence_matcher_edits('kitten', 'sitting')}")
# -> sequence_matcher_edits 3

Edit Distance in Python

11 Answers11

Linked

Related