I am using Fuzzywuzzy in Python to match people names in 2 lists. However, the runtime is too long as one list contains 25000 names and another contains 39000 names. It has been running for 20 hrs now.
Previously, I used the same code to match 2 lists having 6000 and 3000 names, the runtime was 1hr. Based on that, the runtime for my current job would take more than 50 hrs.
Below is my code:
names_array=[]
ratio_array=[]
def match_names(wrong_names,correct_names):
for row in wrong_names:
x=process.extractOne(row, correct_names, scorer=fuzz.token_set_ratio)
names_array.append(x[0])
ratio_array.append(x[1])
return names_array,ratio_array
df=pd.read_csv("wrong-country-names.csv",encoding="ISO-8859-1")
wrong_names=df['name'].dropna().values
choices_df=pd.read_csv("country-names.csv",encoding="ISO-8859-1")
correct_names=choices_df['name'].values
name_match,ratio_match=match_names(wrong_names,correct_names)
I chose fuzz.token_set_ratio
as a scorer to perform this many-to-many match based on the data I have.
Below is the sample data:
wrong_names = ['Alberto Steve', 'David Lee']
correct_names = ['Alberto Lee Steve', 'David Steve Lee']
Basically, the wrong names list does not contain middle names, in order to ignore this and generate a reasonable match, I chose fuzz.token_set_ratio
.
By doing research online, I found a solution to install python levenshtein package to speed up the runtime by 4-10 times. However, my job has been running for more than 20 hrs now, I don't want to disrupt the current job, so I will give it a try after this.
I am wondering if there are other options to improve this.
Thanks in advance.