28

I have a list of words

list = ['car', 'animal', 'house', 'animation']

and I want to compare every list item with a string str1 and the output should be the most similar word. Example: If str1 would be anlmal then animal is the most similar word. How can I do this in python? Usually the words I have in my list are good distinguishable from each other.

JohnB
  • 327
  • 1
  • 3
  • 6

2 Answers2

46

Use difflib:

difflib.get_close_matches(word, ['car', 'animal', 'house', 'animation'])

As you can see from perusing the source, the "close" matches are sorted from best to worst.

>>> import difflib
>>> difflib.get_close_matches('anlmal', ['car', 'animal', 'house', 'animation'])
['animal']
mgilson
  • 300,191
  • 65
  • 633
  • 696
  • 9
    Is this time consuming if I have a big list? Or this function has any speed optimization? – Josir Nov 15 '19 at 18:02
3

I checked difflib.get_close_matches(), but it didn't work for me correctly. I write here a robust solution, use as:

closest_match, closest_match_idx = find_closet_match(test_str, list2check)

def find_closet_match(test_str, list2check):
scores = {}
for ii in list2check:
    cnt = 0
    if len(test_str)<=len(ii):
        str1, str2 = test_str, ii
    else:
        str1, str2 = ii, test_str
    for jj in range(len(str1)):
        cnt += 1 if str1[jj]==str2[jj] else 0
    scores[ii] = cnt
scores_values        = numpy.array(list(scores.values()))
closest_match_idx    = numpy.argsort(scores_values, axis=0, kind='quicksort')[-1]
closest_match        = numpy.array(list(scores.keys()))[closest_match_idx]
return closest_match, closest_match_idx
amit
  • 41
  • 3
  • do you know if it is possible to return not only the closest but maybe top n, maybe top 5? – Jorge A. Salazar Jul 13 '21 at 19:22
  • I know that is possible – Hacky Dec 21 '22 at 21:47
  • Well @JorgeA.Salazar, a simple solution for now would be just run the function n times and the remove the closest_match from the list2check after each iteration. One can also try to modify the code. – amit Jan 06 '23 at 21:16