3

******Edited with Solution Below*******

I have carefully read the guidelines, hope the question is acceptable.

I have two pandas dataframes, I need to apply a fuzzy matching function on the target and reference columns and merge the data based on the similarity score preserving the original data.

i have checked similar questions, e.g. see:
is it possible to do fuzzy match merge with python pandas? but I am not able to use this solution. So far I have:

df1 = pd.DataFrame({'NameId': [1,2,3], 'Type': ['Person','Person','Person'], 'RefName': ['robert johnes','lew malinsky','gioberto delle lanterne']})
df2 = pd.DataFrame({'NameId': [1,2,3], 'Type': ['Person','Person','Person'],'TarName': ['roberto johnes','lew malinosky','andreatta della blatta']})

import distance
fulldf=[]
for name1 in  df1['RefName']:
    for name2 in df2['TarName']:
        if distance.jaccard(name1, name2)<0.6:
            fulldf.append({'RefName':name1 ,'Score':distance.jaccard(name1, name2),'TarName':name2 })
pd_fulldf= pd.DataFrame(fulldf)
  1. How can I include the 'NameId' and 'Type' (and eventual other columns) in the final output e.g.:

    df1_NameId  RefName         df1_Type    df1_NewColumn   Score   df2_NameId  TarName         df2_Type    df2_NewColumn
       1        robert johnes   Person           …          0.0000     1        roberto johnes  Person          …
    
  2. Is there a way to code this so that is easily scalable, and can be performed on datasets with hundred thousands of rows?


I have solved the original problem by unpacking the dataframes in the loop:

import distance
import pandas as pd
#Create test Dataframes
df1 = pd.DataFrame({'NameId': [1,2,3], 'RefName': ['robert johnes','lew       malinsky','gioberto delle lanterne']})
df2 = pd.DataFrame({'NameId': [1,2,3], 'TarName': ['roberto johnes','lew malinosky','andreatta della blatta']})
results=[]
#Create two generators objects to loop through each dataframe row one at the time
#Call each dataframe element that you want to have in the final output in the loop
#Append results to the empty list you created
for a,b,c in df1.itertuples():
    for d,e,f in df2.itertuples():
        results.append((a,b,c,distance.jaccard(c, f),e,d,f))

result_df=pd.DataFrame(results)
print(result_df)
Community
  • 1
  • 1
sdarg
  • 31
  • 3

1 Answers1

0

I believe what you need is Cartesian Product of TarName and RefName. Applying distance function to the product is the result you required.

df1["mergekey"] = 0
df2["mergekey"] = 0
df_merged = pd.merge(df1, df2, on = "mergekey")
df_merged["Distance"] = df_merged.apply(lambda x: distance.jaccard(x.RefName, x.TarName), axis = 1)

Result:

    NameId_x    RefName Type_x  mergekey    NameId_y    TarName Type_y  Distance
0   1   robert johnes   Person  0   1   roberto johnes  Person  0.000000
1   1   robert johnes   Person  0   2   lew malinosky   Person  0.705882
2   1   robert johnes   Person  0   3   andreatta della blatta  Person  0.538462
3   2   lew malinsky    Person  0   1   roberto johnes  Person  0.764706
4   2   lew malinsky    Person  0   2   lew malinosky   Person  0.083333
5   2   lew malinsky    Person  0   3   andreatta della blatta  Person  0.666667
6   3   gioberto delle lanterne Person  0   1   roberto johnes  Person  0.533333
7   3   gioberto delle lanterne Person  0   2   lew malinosky   Person  0.588235
8   3   gioberto delle lanterne Person  0   3   andreatta della blatta  Person  0.250000
Community
  • 1
  • 1
TurtleIzzy
  • 997
  • 7
  • 14
  • Hi @TurtleIzzy , thanks a lot. I have a doubt as the code, with two relatively small lists, produced a memory error: The 0.6 threshold I used (the if statement, in my code) gives me a way to reduce the size of the final DF. Instead here, by merging before, it creates a DF for data that will need to be removed later. Is there way to store as little information as possible,or at list to split the process in smaller chunks, as I work with huge lists? – sdarg Oct 26 '16 at 14:14
  • Since for-loops in python is slower to a great extent than native C implement of `DataFrame` methods, I generated and saved the middle result so as to utilize native apply method for better time performance at the expense of space. If your dataframe is so large that the memory is insufficient to hold the cartesian product, I would suggest using `numba` package to compile your for-loop into C loops. Since `numba` supports only native objects as well as a few `numpy` objects at best performance, you might need to implement `distance.jaccard` yourself. – TurtleIzzy Oct 26 '16 at 15:00
  • Hi @TurtleIzzy, thanks again. I have created a for loop with df.itertuples() that does the job (added to the original post). I am running now the code and it is indeed helpful. I have upvoted your first reply as it solves the problems for smaller datasets, but due to my low reputation it does not show. I will try numba to make the code quicker as you suggested. – sdarg Oct 28 '16 at 10:40