3

I am using fuzzywuzzy to calculate the similarity between two sentences. Here are some results that make no sense to me:

from fuzzywuzzy import fuzz

s1 = "moist tender pork loin chop"
s2 = "corn bicolor"
fuzz.token_sort_ratio(s1,s2)

This gives me a score of 41. On the other hand:

s1 = "store cut sweet yellow corn tray"
s2 = "corn bicolor"
fuzz.token_sort_ratio(s1,s2)

gives me a score of 18.

How can a score between two sentences that do actually have an overlapping word ("corn" in this case) be lower than the score for the sentences with no overlapping words?

Thank you!

user3490622
  • 939
  • 2
  • 11
  • 30
  • I don't know why. But you can use `token_set_ratio`, it outputs `50` for the second example. Also see https://stackoverflow.com/q/31806695/304209. – Dennis Golomazov Jul 05 '18 at 18:43

1 Answers1

2

Fuzzywuzzy is implemented using Levenshtein distance. From the wikipedia:

Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.

EDIT As was pointed out by @dennis-golomazov. There are important detail differences between token_sort_ratio and token_set_ratio.

token_sort_ratio has four steps:

  1. Split string into tokens
  2. Sort tokens
  3. Call Levenshtein ratio from https://github.com/ztane/python-Levenshtein on the sorted tokens.
  4. Return the ratio * 100

Notice that this algorithm doesn't care about partial matches

When these steps happen on your string, the code essentially becomes:

from Levenshtein import StringMatcher as sm

s1 = "chop loin moist tender pork"
s2 = "bicolor corn"

m = sm.StringMatcher(None, s1, s2)
print(int(m.ratio() * 100))

s1 = "corn cut store sweet tray yellow"
s2 = "bicolor corn"

m = sm.StringMatcher(None, s1, s2)
print(int(m.ratio() * 100))

You'll notice that these ratios match the ones you saw in your test case.

So, you would definitely want to use fuzz.token_set_ratio as it accounts for the fact that corn is in both strings and can match accordingly

mkamerath
  • 312
  • 2
  • 12