0

I am basically trying to join 2 dataframes using approximate match. How I do this in general is listed below:

  • have the list of strings to matched
  • define a function using fuzzy's process.extract
  • apply this function across all rows in the 1st dataframe to get a match
  • join 1st DF with the 2nd DF based on matching key.

This is my code:

def closest_match(x):
    matched = (process.extract(x, matchlist[matchlist.match_name.str.startswith(x[:3])].match_name, limit=1, scorer=fuzz.token_sort_ratio))
    if matched:
        print(matched[0])
        return matched[0][0]
    else:
        return None


df1['key'] = df1.df1_name.apply(lambda x: closest_match(x))
# merge with 2nd df
joined = df1.merge(df2, left_on='key', right_on='df2_name')

The problem here is about speed. This code takes a very long time for loops of 10000 iteration. And I need this for 100K match. How to speed this code up?

addicted
  • 2,901
  • 3
  • 28
  • 49
  • Have you tried nested for loops. For me it works a bit faster for larger records. – Rahul Agarwal Oct 04 '18 at 08:27
  • @RahulAgarwal u mean using `for loop` instead of `apply`? I have not tried that because I am expecting that for loop will take longer. I am looking if there is `vectorization` kind of method. – addicted Oct 10 '18 at 10:15
  • See this link https://stackoverflow.com/questions/52631291/vectorizing-or-speeding-up-fuzzywuzzy-string-matching-on-pandas-column. If it is still slow, I think try for nested for loops. For me they were faster. – Rahul Agarwal Oct 10 '18 at 10:47
  • @RahulAgarwal let me try on that first. – addicted Oct 10 '18 at 11:10

0 Answers0