I am basically trying to join 2 dataframes using approximate match. How I do this in general is listed below:
- have the list of strings to matched
- define a function using fuzzy's process.extract
- apply this function across all rows in the 1st dataframe to get a match
- join 1st DF with the 2nd DF based on matching key.
This is my code:
def closest_match(x):
matched = (process.extract(x, matchlist[matchlist.match_name.str.startswith(x[:3])].match_name, limit=1, scorer=fuzz.token_sort_ratio))
if matched:
print(matched[0])
return matched[0][0]
else:
return None
df1['key'] = df1.df1_name.apply(lambda x: closest_match(x))
# merge with 2nd df
joined = df1.merge(df2, left_on='key', right_on='df2_name')
The problem here is about speed. This code takes a very long time for loops of 10000 iteration. And I need this for 100K match. How to speed this code up?