comparison of two strings and returning the one most similar

Question

I have to write a function that takes a string as argument and compair this string to two other strings and return the string most similar and the number of differences.

def func("LUMB"):
    lst=["JIBM", "NUNE", "NUMB"]
should return:
("NUMB",1)

I have tried:

def f(word):
    lst=["JIBM", "NUNE", "NUMB"]
    for i in lst:
        d=k(word, lst)
        return differences
        for n in d:
            print min(sum(n))

where:

def k(word1, word2):
    L=[]
    for w in range(len(word1)):
        if word1[w] != word2[w]:
            L.append(1)
        else:
            L.append(0)
    return L

so that i get a list of eg, [1,0,0,0] if word1="NUMB" and word2="LUMB"

Have you seen [Text difference algorithm](http://stackoverflow.com/questions/145607/text-difference-algorithm) and [Good Python modules for fuzzy string comparison](http://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison) — Chris, Dec 15 '11 at 11:21
A number of answers would be available on this link too http://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison — Zain Khan, Dec 15 '11 at 11:32
On the site there is a similar post. You would get some more valuable answers over here http://stackoverflow.com/questions/682367/good-python-modules-for-fuzzy-string-comparison — Zain Khan, Dec 15 '11 at 11:33

Lauritz V. Thaulow · Answer 1 · 2011-12-15T11:33:25.280

10

Looks like Shawn Chin has provided the best solution, but if you're prevented from using non-builtin modules, it seems like get_close_matches from difflib might help:

import difflib
difflib.get_close_matches("LUMB", ["JIBM", "NUNE", "NUMB"], 1)

The number of differences can be gotten using the get_opcodes method of SequenceMatcher and working with its return value.

edited Dec 15 '11 at 11:33

answered Dec 15 '11 at 11:20

Lauritz V. Thaulow

49,139
12
73
92

While this does not return the edit distance, I like that it uses only the standard library. +1 – Shawn Chin Dec 15 '11 at 11:49

Shawn Chin · Answer 2 · 2011-12-15T11:47:34.897

Using pylevenshtein to calculate Levenshtein distance:

>>> from Levenshtein import distance
>>> from operator import itemgetter
>>> lst = ["JIBM", "NUNE", "NUMB"]
>>> min([(x, distance("LUMB", x)) for x in lst], key=itemgetter(1))
('NUMB', 1)

Or, as a function:

from Levenshtein import distance
from operator import itemgetter
def closest(word, lst):
    return min([(x, distance(word, x)) for x in lst], key=itemgetter(1))

print closest("NUMB", ["JIBM", "NUNE", "NUMB"])

p.s. If you want to avoid additional dependencies, you could always implement your own function for calculating the distance. For example, several version are proposed in wikibooks each with their own pros and cons.

However, if performance is a concern, do consider sticking to the custom built modules. Apart from pylevenshtein, there's also python-levenshtein and nltk.metrics.distance (if you happen to already use NLTK).

comparison of two strings and returning the one most similar

2 Answers2