8

I am using Fuzzywuzzy in Python to match people names in 2 lists. However, the runtime is too long as one list contains 25000 names and another contains 39000 names. It has been running for 20 hrs now.

Previously, I used the same code to match 2 lists having 6000 and 3000 names, the runtime was 1hr. Based on that, the runtime for my current job would take more than 50 hrs.

Below is my code:

names_array=[]
ratio_array=[]
def match_names(wrong_names,correct_names):
    for row in wrong_names:
        x=process.extractOne(row, correct_names, scorer=fuzz.token_set_ratio)
        names_array.append(x[0])
        ratio_array.append(x[1])
    return names_array,ratio_array

df=pd.read_csv("wrong-country-names.csv",encoding="ISO-8859-1")
wrong_names=df['name'].dropna().values

choices_df=pd.read_csv("country-names.csv",encoding="ISO-8859-1")
correct_names=choices_df['name'].values

name_match,ratio_match=match_names(wrong_names,correct_names)

I chose fuzz.token_set_ratio as a scorer to perform this many-to-many match based on the data I have.

Below is the sample data:

wrong_names = ['Alberto Steve', 'David Lee']
correct_names = ['Alberto Lee Steve', 'David Steve Lee'] 

Basically, the wrong names list does not contain middle names, in order to ignore this and generate a reasonable match, I chose fuzz.token_set_ratio.

By doing research online, I found a solution to install python levenshtein package to speed up the runtime by 4-10 times. However, my job has been running for more than 20 hrs now, I don't want to disrupt the current job, so I will give it a try after this.

I am wondering if there are other options to improve this.

Thanks in advance.

MMAASS
  • 433
  • 4
  • 18
  • Did you discover any other performance improvements? – Life is complex Jul 19 '19 at 16:12
  • Have you found some speed up trick? – A2N15 Jan 30 '20 at 08:32
  • What about levenshtein distance? – amirouche Sep 11 '20 at 10:26
  • Can you make some assumption like the first letters must match because it's always FIRST (MIDDLE) LAST? Then filter the second input so that you are only matching A* against lets say 'Alberto Steve' because Alberto Steve is probably not Walter White so why try matching all of the Ws. I'm not familiar with fuzzywuzzy, this was just a random thought. – Correy Koshnick Jan 25 '22 at 17:21
  • using `rapidfuzz` instead of `fuzzywuzzy`. rapidfuzz does some string operations before calculating the Levenstein distance which drastically reduces processing time. `pip install rapidfuzz` and `from rapidfuzz import extractOne , fuzz` – Shreyesh Desai May 05 '22 at 10:35

0 Answers0