Fuzzywuzzy is implemented using Levenshtein distance. From the wikipedia:
Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertions, deletions or substitutions) required to change one word into the other.
EDIT
As was pointed out by @dennis-golomazov. There are important detail differences between token_sort_ratio and token_set_ratio.
token_sort_ratio has four steps:
- Split string into tokens
- Sort tokens
- Call Levenshtein ratio from https://github.com/ztane/python-Levenshtein on the sorted tokens.
- Return the ratio * 100
Notice that this algorithm doesn't care about partial matches
When these steps happen on your string, the code essentially becomes:
from Levenshtein import StringMatcher as sm
s1 = "chop loin moist tender pork"
s2 = "bicolor corn"
m = sm.StringMatcher(None, s1, s2)
print(int(m.ratio() * 100))
s1 = "corn cut store sweet tray yellow"
s2 = "bicolor corn"
m = sm.StringMatcher(None, s1, s2)
print(int(m.ratio() * 100))
You'll notice that these ratios match the ones you saw in your test case.
So, you would definitely want to use fuzz.token_set_ratio as it accounts for the fact that corn is in both strings and can match accordingly