How to speed up Fuzzy Matching using Fuzzywuzzy in Python

Question

I am using Fuzzywuzzy in Python to match people names in 2 lists. However, the runtime is too long as one list contains 25000 names and another contains 39000 names. It has been running for 20 hrs now.

Previously, I used the same code to match 2 lists having 6000 and 3000 names, the runtime was 1hr. Based on that, the runtime for my current job would take more than 50 hrs.

Below is my code:

names_array=[]
ratio_array=[]
def match_names(wrong_names,correct_names):
    for row in wrong_names:
        x=process.extractOne(row, correct_names, scorer=fuzz.token_set_ratio)
        names_array.append(x[0])
        ratio_array.append(x[1])
    return names_array,ratio_array

df=pd.read_csv("wrong-country-names.csv",encoding="ISO-8859-1")
wrong_names=df['name'].dropna().values

choices_df=pd.read_csv("country-names.csv",encoding="ISO-8859-1")
correct_names=choices_df['name'].values

name_match,ratio_match=match_names(wrong_names,correct_names)

I chose fuzz.token_set_ratio as a scorer to perform this many-to-many match based on the data I have.

Below is the sample data:

wrong_names = ['Alberto Steve', 'David Lee']
correct_names = ['Alberto Lee Steve', 'David Steve Lee']

Basically, the wrong names list does not contain middle names, in order to ignore this and generate a reasonable match, I chose fuzz.token_set_ratio.

By doing research online, I found a solution to install python levenshtein package to speed up the runtime by 4-10 times. However, my job has been running for more than 20 hrs now, I don't want to disrupt the current job, so I will give it a try after this.

I am wondering if there are other options to improve this.

Thanks in advance.

Can you make some assumption like the first letters must match because it's always FIRST (MIDDLE) LAST? Then filter the second input so that you are only matching A* against lets say 'Alberto Steve' because Alberto Steve is probably not Walter White so why try matching all of the Ws. I'm not familiar with fuzzywuzzy, this was just a random thought. — Correy Koshnick, Jan 25 '22 at 17:21
using `rapidfuzz` instead of `fuzzywuzzy`. rapidfuzz does some string operations before calculating the Levenstein distance which drastically reduces processing time. `pip install rapidfuzz` and `from rapidfuzz import extractOne , fuzz` — Shreyesh Desai, May 05 '22 at 10:35

How to speed up Fuzzy Matching using Fuzzywuzzy in Python

0 Answers0