7

I have two dataframe df1 and df2.

df1 = pd.DataFrame ({'Name': ['Adam Smith', 'Anne Kim', 'John Weber', 'Ian Ford'],
                     'Age': [43, 21, 55, 24]})
df2 = pd.DataFrame ({'Name': ['adam Smith', 'Annie Kim', 'John  Weber', 'Ian Ford'],
                     'gender': ['M', 'F', 'M', 'M']})

I need to join these two dataframe with pandas.merge on the column Name. However, as you notice, there are some slight difference between column Name from the two dataframe. Let's assume they are the same person. If I simply do:

pd.merge(df1, df2, how='inner', on='Name')

I only got a dataframe back with only one row, which is 'Ian Ford'.

Does anyone know how to merge these two dataframe ? I guess this is pretty common situation if we join two tables on a string column. I have absolutely no idea how to handle this. Thanks a lot in advance.

zesla
  • 11,155
  • 16
  • 82
  • 147

2 Answers2

11

I am using fuzzywuzzy here

from fuzzywuzzy import fuzz
from fuzzywuzzy import process



df2['key']=df2.Name.apply(lambda x : [process.extract(x, df1.Name, limit=1)][0][0][0])

df2.merge(df1,left_on='key',right_on='Name')
Out[1238]: 
        Name_x gender         key  Age      Name_y
0   adam Smith      M  Adam Smith   43  Adam Smith
1    Annie Kim      F    Anne Kim   21    Anne Kim
2  John  Weber      M  John Weber   55  John Weber
3     Ian Ford      M    Ian Ford   24    Ian Ford
BENY
  • 317,841
  • 20
  • 164
  • 234
  • Thank you ! this is what I need. what does 'process.extract(x, df1.Name, limit=1)][0][0][0]' do here? – zesla Mar 05 '18 at 22:30
  • @zesla getting the first matching Item , and that result is odd, it is tuple of tuple, that is why we need [0][0][0] here .. do not worry about it – BENY Mar 05 '18 at 22:35
  • a follow-u question, for process.extract, there used to be a argument score_cutoff, which I think is pretty useful. It seems that it's not there anymore. Do you know how should I define the score cutoff? Thank you so much. @Wen – zesla Mar 06 '18 at 04:58
  • @zesla [score for found, score, matchrow in process.extract(x, df1.Name, limit=1)], you can find the score, that is why We need [0][0][0], cause the function return multiple result :-) – BENY Mar 06 '18 at 05:02
  • What I mean was the argument score_cutoff in the process.extract function. It used to be there. I googled and found some of its usage. Basically if the score is less than the cutoff, it returns nan. However, when I check the function argument, it seems that it's not there any more....@Wen – zesla Mar 06 '18 at 05:07
  • @zesla I found this https://stackoverflow.com/questions/32055817/python-fuzzy-matching-fuzzywuzzy-keep-only-best-match, you can setting up your own limit score – BENY Mar 06 '18 at 05:10
  • thank you very much for the link. I think it's soooo weird that there is no such argument anymore, isn't it? Just don't understand why... do we miss anything? @Wen – zesla Mar 06 '18 at 05:14
  • I figure it out. process.extractOne has score_cutoff, not process.extract... Sorry about the confusion. @Wen – zesla Mar 06 '18 at 05:16
  • So I think in this case,(or in general), it's probably better to use process.extractOne and also set a score_cutoff, so that it does not always produce a match, if two strings are too different.., what do you think? @Wen – zesla Mar 06 '18 at 05:22
  • This is incredible and saved me hours of manual work. Thank you very much. – campo Dec 05 '19 at 21:43
  • @campo yw :-) happy coding – BENY Dec 05 '19 at 22:00
0

Not sure if fuzzy match is what you are looking for. Maybe make every name a proper name?

df1.Name = df1.Name.apply(lambda x: x.title())
df2.Name = df2.Name.apply(lambda x: x.title())

pd.merge(df1, df2, how='inner', on='Name')
A.Kot
  • 7,615
  • 2
  • 22
  • 24