0

I am using fuzzywuzzy and rapidfuzz to find names mentioned in comments. I read through the documentation of the "token_set_ratio" function but I still don't understand the following:

# I preprocessed the comments to remove stop words and commonly mentioned other words

fuzz.token_set_ratio("reporting michael anders sven straumann guy called jonatjan smith partners","jonathan smith")

# returns 52.6

Jonathan Smith has only one spelling mistake, why is the ratio so low?

Moreover, would there be an option to overcome the problem so that Jonathan receives a higher score?

thanks for your help, Michael

1 Answers1

0

Fuzz.token_set_ratio is not really the right ratio for your problem, since it sorts the words, while you would like to keep the pairing of first and second name. You could use fuzz.partial_ratio to compare only the best matching substring of the longer string to the shorter string.

fuzz.partial_ratio(
  "reporting michael anders sven straumann guy called jonatjan smith partners",
  "jonathan smith")
# returns 92.85714285714286
maxbachmann
  • 2,862
  • 1
  • 11
  • 35
  • Thanks for your help. This actually solves the specific problem I mentioned here. However, this function will (for instance) return 100 for each michael mentioned and not only for michael anders. – Michael Altorfer Oct 09 '20 at 09:17
  • I am not entirely sure what you mean. Do you mean that it returns 100 when you search for "michael" instead of "jonathan smith"? – maxbachmann Oct 09 '20 at 09:27
  • My actual problem is that I have a list of names, be it [jonathan smith, michael anders, michael stralund,...] and I am trying to find all the names that are mentioned in each comment. Now when I apply token_set_ratio the score for the missspelled jonathan is so low that I would need to exclude it. However, using partial_token_set leads to 100 for both michaels although only one is mentioned. – Michael Altorfer Oct 10 '20 at 08:18
  • I updated the anser. Using fuzz.partial_ratio should work a lot better for your use case. – maxbachmann Oct 10 '20 at 09:02