******Edited with Solution Below*******
I have carefully read the guidelines, hope the question is acceptable.
I have two pandas dataframes, I need to apply a fuzzy matching function on the target and reference columns and merge the data based on the similarity score preserving the original data.
i have checked similar questions, e.g. see:
is it possible to do fuzzy match merge with python pandas?
but I am not able to use this solution.
So far I have:
df1 = pd.DataFrame({'NameId': [1,2,3], 'Type': ['Person','Person','Person'], 'RefName': ['robert johnes','lew malinsky','gioberto delle lanterne']})
df2 = pd.DataFrame({'NameId': [1,2,3], 'Type': ['Person','Person','Person'],'TarName': ['roberto johnes','lew malinosky','andreatta della blatta']})
import distance
fulldf=[]
for name1 in df1['RefName']:
for name2 in df2['TarName']:
if distance.jaccard(name1, name2)<0.6:
fulldf.append({'RefName':name1 ,'Score':distance.jaccard(name1, name2),'TarName':name2 })
pd_fulldf= pd.DataFrame(fulldf)
How can I include the 'NameId' and 'Type' (and eventual other columns) in the final output e.g.:
df1_NameId RefName df1_Type df1_NewColumn Score df2_NameId TarName df2_Type df2_NewColumn 1 robert johnes Person … 0.0000 1 roberto johnes Person …
Is there a way to code this so that is easily scalable, and can be performed on datasets with hundred thousands of rows?
I have solved the original problem by unpacking the dataframes in the loop:
import distance
import pandas as pd
#Create test Dataframes
df1 = pd.DataFrame({'NameId': [1,2,3], 'RefName': ['robert johnes','lew malinsky','gioberto delle lanterne']})
df2 = pd.DataFrame({'NameId': [1,2,3], 'TarName': ['roberto johnes','lew malinosky','andreatta della blatta']})
results=[]
#Create two generators objects to loop through each dataframe row one at the time
#Call each dataframe element that you want to have in the final output in the loop
#Append results to the empty list you created
for a,b,c in df1.itertuples():
for d,e,f in df2.itertuples():
results.append((a,b,c,distance.jaccard(c, f),e,d,f))
result_df=pd.DataFrame(results)
print(result_df)