I am trying to join a couple of datasets using fuzzy matching using the fuzzywuzzy package the function is written:
is it possible to do fuzzy match merge with python pandas?
Here is the code that I have:
import pandas as pd
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
blanks = pd.read_csv("names_blank_type.csv")
mapping = pd.read_csv("TYPE-MAP.csv")
blanks = pd.DataFrame(blanks)
blanks = blanks.drop(blanks.columns[[0,1]], axis=1)
mapping = pd.DataFrame(mapping)
mapping = mapping.drop(mapping.columns[[2]], axis=1)
blanks_smaller = blanks.head(100)
mapping_smaller = mapping.head(100)
def fuzzy_merge(df_1, df_2, key1, key2, threshold=90, limit=2):
s = df_2[key2].tolist()
m = df_1[key1].apply(lambda x: process.extract(x, s, limit=limit))
df_1['matches'] = m
m2 = df_1['matches'].apply(lambda x: ', '.join([i[0] for i in x if i[1] >= threshold]))
df_1['matches'] = m2
return df_1
fuzzy_merge(mapping_smaller, blanks_smaller, str('company_name'), str('company_name'), threshold=80)
The fuzzy_merge function actually works if I'm just using mapping_smaller and blanks_smaller, but if I try to use the full dataset then I get the following error:
The full blanks dataset has just under 350,000 rows and the full mapping set has just under 100,000 rows. Any thoughts on why it works with the small set but not the larger one that's giving the above error?
Thanks!!