pandas - fuzzywuzzy - speeding loop up when doing fuzzymatching?

Question

I am basically trying to join 2 dataframes using approximate match. How I do this in general is listed below:

have the list of strings to matched
define a function using fuzzy's process.extract
apply this function across all rows in the 1st dataframe to get a match
join 1st DF with the 2nd DF based on matching key.

This is my code:

def closest_match(x):
    matched = (process.extract(x, matchlist[matchlist.match_name.str.startswith(x[:3])].match_name, limit=1, scorer=fuzz.token_sort_ratio))
    if matched:
        print(matched[0])
        return matched[0][0]
    else:
        return None


df1['key'] = df1.df1_name.apply(lambda x: closest_match(x))
# merge with 2nd df
joined = df1.merge(df2, left_on='key', right_on='df2_name')

The problem here is about speed. This code takes a very long time for loops of 10000 iteration. And I need this for 100K match. How to speed this code up?

Have you tried nested for loops. For me it works a bit faster for larger records. — Rahul Agarwal, Oct 04 '18 at 08:27
@RahulAgarwal u mean using `for loop` instead of `apply`? I have not tried that because I am expecting that for loop will take longer. I am looking if there is `vectorization` kind of method. — addicted, Oct 10 '18 at 10:15
See this link https://stackoverflow.com/questions/52631291/vectorizing-or-speeding-up-fuzzywuzzy-string-matching-on-pandas-column. If it is still slow, I think try for nested for loops. For me they were faster. — Rahul Agarwal, Oct 10 '18 at 10:47

pandas - fuzzywuzzy - speeding loop up when doing fuzzymatching?

0 Answers0