I have two dataframes df1 and df2.
df1 = pd.DataFramE({'Name': ['Zebra system','Lion healthcare'], 'Type': ['S','A']})
df2 = pd.DataFrame({'AltName': ['Zebra system llc','abra inc. 54','Lions corp health care','Zebra sys co','lions system atl'], 'Adr': ['45 main st','23 zoo ave', '12 zoo blvd.','56 veg st','23 peach st']})
Example above. df2 has about 300k records and df1 has about 10k records. I want to match df1 Name with df2 AltName and get a new dataframe with df1 and potential matched rows from df2 along with the score. I want to be able to have a score threshold that i can adjust, for instance add matches from df2 that are above 80.
This is what I have right now:
matched=pd.DataFrame({'df1-name':[],'df1-type':[],'df2-name':[],'df2-adr':[]})
for row in df1.index:
first = df1.loc[row,"Name"]
type1 = df1.loc[row,"Type"]
for row2 in df2.index:
second = df2.loc[row2,"AltName"]
adr1 = df2.loc[row2,"Adr"]
matched_token = fuzz.partial_ratio(first,second)
if matched_token>60:
matched.loc[row2,"df1-name"]=first
matched.loc[row2,"df1-type"]=type1
matched.loc[row2,"df2-name"]=second
matched.loc[row2,"df2-adr"]=adr1
however this is very slow, code has been running for over 16hrs and going...