How to find similar word in set?

Question

word = "work" word_set = {"word","look","wrap","pork"}

How can I find the similar word such that both "word" and "pork" need only one letter to change to the "work"?

I am wondering that if there is a method to find the difference between a string and the item in set.

Actually, the correct term to search for is "`Levenshtein distance`". — MattDMo, Mar 22 '16 at 01:15
Levenshtein distance is just a specific metric from a family of distance metrics. — Dmitry B., Mar 22 '16 at 22:26

RootTwo · Accepted Answer · 2016-03-22T14:46:05.540

4

Use difflib.get_close_matches() from the standard library:

import difflib

word = "work"
word_set = {"word","look","wrap","pork"}

difflib.get_close_matches(word, word_set)

returns:

['word', 'pork']

EDIT If needed, difflib.SequenceMatcher.get_opcodes() can be used to calculate the edit distance:

matcher = difflib.SequenceMatcher(b=word)
for test_word in word_set:
    matcher.set_seq1(test_word)
    distance = len([m for m in matcher.get_opcodes() if m[0]!='equal'])
    print(distance, test_word)

edited Mar 22 '16 at 14:46

answered Mar 22 '16 at 04:46

RootTwo

4,288
1
11
15

Nice - hadn't heard of difflib. Note that get_close_matches will also return an exact match, so you should check for that and remove it. Also, it is returning words with greater similarity than the threshold (default 0.6) rather than specifically 1 character off - this would become apparent with an example with longer words, where the same code would return words that were off by more characters. Here we get lucky, since a 4 character word with 1 character off has similarity 0.75, vs. 2 characters off which has a similarity of 0.5. – emmagordon Mar 22 '16 at 13:17

emmagordon · Answer 2 · 2016-03-22T01:31:52.563

0

You could do something like:

word = "work"
word_set = set(["word","look","wrap","pork"])

for example in word_set:
    if len(example) != len(word):
        continue
    num_chars_out = sum([1 for c1,c2 in zip(example, word) if c1 != c2])
    if num_chars_out == 1:
        print(example)

edited Mar 22 '16 at 01:31

answered Mar 22 '16 at 01:25

emmagordon

1,222
8
17

score 0 · Answer 3 · answered Mar 22 '16 at 01:35

I would recommend the editdistance Python package, which provides an editdistance.eval function that calculates the number of characters you need to change to get from the first word to the second word. Edit distance is the same as Levenshtein distance, which was suggested by MattDMo.

In your case, if you want to identify words within 1 edit distance of each other, you could do:

import editdistance as ed

thresh = 1
w1 = "work"
word_set = set(["word","look","wrap","pork"])
neighboring_words = [w2 for w2 in word_set if ed.eval(w1, w2) <= thresh]

print neighboring_words

with neighboring_words evaluating to ['pork', 'word'].

How to find similar word in set?

3 Answers3