14

I am trying to understand how the python module fuzzywuzzy's function process.extract() work?

I mainly read about the fuzzywuzzy package here: http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/, which is a great post explanining different scenarios when trying to do fuzzy matching. They discussed several scenarios for Partial String Similarity:

1) Out Of Order
2) Token Sort
3) Token Set

And then, from this post: https://pathindependence.wordpress.com/2015/10/31/tutorial-fuzzywuzzy-string-matching-in-python-improving-merge-accuracy-across-data-products-and-naming-conventions/ I learned how to use fuzzywuzzy's process.extract() function to basically select the top k matches.

I cannot find too much info regarding how the process.extract() function works. Here's the definition/information I found on their GitHub page (https://github.com/seatgeek/fuzzywuzzy/blob/master/fuzzywuzzy/process.py), that this function:

Find best matches in a list or dictionary of choices, return a list of tuples containing the match and it's score. If a dictionary is used, also returns the key for each match.

However, it does not provide details regarding HOW it's finding the best? Did it take all the 3 scenarios I've mentioned above to find this?

The reason why I ask, is because, when I used this function, sometimes there are two strings that are very similar but are not matched.

for example in my current sample data set, for the to-be-match-string

"Total replenishment lead time (in workdays)"

it is matched to

"PLANNING_TIME_FENCE_CODE", "BUILD_IN_WIP_FLAG"

but not to (the right answer)

"FULL_LEAD_TIME"

Even though the right answer has "lead time" just like the to-be-match-string does, it is not matched to the to-be-match-string at all. WHY? and somehow, the other ones that do not look like the to-be-match-string get to be matched. WHY? I am quite clueless now.

alwaysaskingquestions
  • 1,595
  • 5
  • 22
  • 49

3 Answers3

19

The other answer is wrong in a key respect - the inference that the result of process.extract was the same as fuzz.partial_ratio in one case, therefore they are doing the same thing by default.

process.extract actually uses WRatio() by default, which is a weighted combination of the four fuzz ratios. This is actually a cool functionality that empirically works pretty well across fuzzy matching scenarios.

Still, you can manually specify the string comparison function via the scorer argument to extract

Source for process.extract:https://github.com/seatgeek/fuzzywuzzy/blob/master/fuzzywuzzy/process.py

Jack Rowntree
  • 193
  • 1
  • 5
8

There are four ratio in fuzzywuzzy comparison.

  • base_ratio: The Levenshtein Distance of two string.
  • partial_ratio: The ratio of most similar substring.
  • token_sort_ratio: Measure of the sequences' similarity sorting the token before comparing.
  • token_set_ratio: Find all alphanumeric tokens in each string.

More details on ration can be found here http://chairnerd.seatgeek.com/fuzzywuzzy-fuzzy-string-matching-in-python/

By default process.extract() use Partial_ratio for comparison, but you can also override it with third parameter to process.extract()

Ex.

print(fuzz.partial_ratio('Total replenishment lead time (in workdays)', 'Lead_time_planning'))
query = 'Total replenishment lead time (in workdays)'
choices = ['PLANNING_TIME_FENCE_CODE', 'BUILD_IN_WIP_FLAG','Lead_time_planning']
print(process.extract(query, choices))

Results will be :

50
[('Lead_time_planning', 50), ('PLANNING_TIME_FENCE_CODE', 38), ('BUILD_IN_WIP_FLAG', 26)]

Which shows it is by default using partial_ratio, which you can override anytime.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Ashpak Mulani
  • 93
  • 1
  • 7
  • 4
    override example `process.extract(query,choices, scorer=fuzzywuzzy.token_set_ratio` – VISQL Jun 01 '20 at 10:19
  • 1
    Today you have to do `from fuzzywuzzy import fuzz` and then pass the scorer as `fuzz.token_set_ratio`. – igorkf Nov 18 '20 at 19:25
1

I've been asking myself the same question about the process.extract default scorer WRatio -> for some reason the result is really weird, all the other scorer identify the correct match, ie Alphabet, but probably due to a lonely A character in my query string I get a higher match for A/S substrings than Alphabet -> Alphabet with the default parser, if anyone can shed some light on why this is, that'd be awesome:

process.extract("ALPHABET- A",RIFT_IDS['EntityName'], scorer = fuzz.token_set_ratio, limit=3)
[('Alphabet Inc', 89, 4955), ('Haemato AG', 60, 9078), ('Vale SA', 59, 1894)]

process.extract("ALPHABET- A",RIFT_IDS['EntityName'], scorer = fuzz.token_sort_ratio, limit=3)
[('Alphabet Inc', 73, 4955), ('Haemato AG', 60, 9078), ('Vale SA', 59, 1894)]

process.extract("ALPHABET- A",RIFT_IDS['EntityName'], scorer = fuzz.partial_ratio, limit=3)
[('Alphabet Inc', 82, 4955), ('EQT AB', 73, 5838), ('BEL SA', 67, 2430)]

process.extract("ALPHABET- A",RIFT_IDS['EntityName'], scorer = fuzz.ratio, limit=3)
[('Alphabet Inc', 78, 4955), ('Alpha Bank SA', 67, 4720), ('Pharnext SA', 64, 9228)]

process.extract("ALPHABET- A",RIFT_IDS['EntityName'], limit=3)
[('Iss A/S', 86, 4), ('Vestas Wind Systems A/S', 86, 87), ('AP Moeller - Maersk A/S', 86, 126)]

EDIT: Well from the github code there appear to be another scorer called partial_token_set_ratio which appears to be the culprit, still I'd like to understand which scorer are used in Wration and what their respective weights are. It'd be great if it were possible to create our own custom WRatio & chose which scorers it uses.

Edgar
  • 59
  • 2
  • 8