Fuzzywuzzy scores for sentences w/no overlapping words are higher than those with some overlap?

Question

I am using fuzzywuzzy to calculate the similarity between two sentences. Here are some results that make no sense to me:

from fuzzywuzzy import fuzz

s1 = "moist tender pork loin chop"
s2 = "corn bicolor"
fuzz.token_sort_ratio(s1,s2)

This gives me a score of 41. On the other hand:

s1 = "store cut sweet yellow corn tray"
s2 = "corn bicolor"
fuzz.token_sort_ratio(s1,s2)

gives me a score of 18.

How can a score between two sentences that do actually have an overlapping word ("corn" in this case) be lower than the score for the sentences with no overlapping words?

Thank you!

I don't know why. But you can use `token_set_ratio`, it outputs `50` for the second example. Also see https://stackoverflow.com/q/31806695/304209. — Dennis Golomazov, Jul 05 '18 at 18:43

mkamerath · Answer 1 · 2018-07-05T20:40:57.073

Fuzzywuzzy is implemented using Levenshtein distance. From the wikipedia:

Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.

EDIT As was pointed out by @dennis-golomazov. There are important detail differences between token_sort_ratio and token_set_ratio.

token_sort_ratio has four steps:

Split string into tokens
Sort tokens
Call Levenshtein ratio from https://github.com/ztane/python-Levenshtein on the sorted tokens.
Return the ratio * 100

Notice that this algorithm doesn't care about partial matches

When these steps happen on your string, the code essentially becomes:

from Levenshtein import StringMatcher as sm

s1 = "chop loin moist tender pork"
s2 = "bicolor corn"

m = sm.StringMatcher(None, s1, s2)
print(int(m.ratio() * 100))

s1 = "corn cut store sweet tray yellow"
s2 = "bicolor corn"

m = sm.StringMatcher(None, s1, s2)
print(int(m.ratio() * 100))

You'll notice that these ratios match the ones you saw in your test case.

So, you would definitely want to use fuzz.token_set_ratio as it accounts for the fact that corn is in both strings and can match accordingly

Fuzzywuzzy scores for sentences w/no overlapping words are higher than those with some overlap?

1 Answers1