how to 'fuzzy' match strings when merge two dataframe in pandas

Question

I have two dataframe df1 and df2.

df1 = pd.DataFrame ({'Name': ['Adam Smith', 'Anne Kim', 'John Weber', 'Ian Ford'],
                     'Age': [43, 21, 55, 24]})
df2 = pd.DataFrame ({'Name': ['adam Smith', 'Annie Kim', 'John  Weber', 'Ian Ford'],
                     'gender': ['M', 'F', 'M', 'M']})

I need to join these two dataframe with pandas.merge on the column Name. However, as you notice, there are some slight difference between column Name from the two dataframe. Let's assume they are the same person. If I simply do:

pd.merge(df1, df2, how='inner', on='Name')

I only got a dataframe back with only one row, which is 'Ian Ford'.

Does anyone know how to merge these two dataframe ? I guess this is pretty common situation if we join two tables on a string column. I have absolutely no idea how to handle this. Thanks a lot in advance.

score 11 · Accepted Answer · answered Mar 05 '18 at 22:18

11

I am using fuzzywuzzy here

from fuzzywuzzy import fuzz
from fuzzywuzzy import process



df2['key']=df2.Name.apply(lambda x : [process.extract(x, df1.Name, limit=1)][0][0][0])

df2.merge(df1,left_on='key',right_on='Name')
Out[1238]: 
        Name_x gender         key  Age      Name_y
0   adam Smith      M  Adam Smith   43  Adam Smith
1    Annie Kim      F    Anne Kim   21    Anne Kim
2  John  Weber      M  John Weber   55  John Weber
3     Ian Ford      M    Ian Ford   24    Ian Ford

answered Mar 05 '18 at 22:18

BENY

317,841
20
164
234

Thank you ! this is what I need. what does 'process.extract(x, df1.Name, limit=1)][0][0][0]' do here? – zesla Mar 05 '18 at 22:30
@zesla getting the first matching Item , and that result is odd, it is tuple of tuple, that is why we need [0][0][0] here .. do not worry about it – BENY Mar 05 '18 at 22:35
a follow-u question, for process.extract, there used to be a argument score_cutoff, which I think is pretty useful. It seems that it's not there anymore. Do you know how should I define the score cutoff? Thank you so much. @Wen – zesla Mar 06 '18 at 04:58
@zesla [score for found, score, matchrow in process.extract(x, df1.Name, limit=1)], you can find the score, that is why We need [0][0][0], cause the function return multiple result :-) – BENY Mar 06 '18 at 05:02
What I mean was the argument score_cutoff in the process.extract function. It used to be there. I googled and found some of its usage. Basically if the score is less than the cutoff, it returns nan. However, when I check the function argument, it seems that it's not there any more....@Wen – zesla Mar 06 '18 at 05:07
@zesla I found this https://stackoverflow.com/questions/32055817/python-fuzzy-matching-fuzzywuzzy-keep-only-best-match, you can setting up your own limit score – BENY Mar 06 '18 at 05:10
thank you very much for the link. I think it's soooo weird that there is no such argument anymore, isn't it? Just don't understand why... do we miss anything? @Wen – zesla Mar 06 '18 at 05:14
I figure it out. process.extractOne has score_cutoff, not process.extract... Sorry about the confusion. @Wen – zesla Mar 06 '18 at 05:16
So I think in this case,(or in general), it's probably better to use process.extractOne and also set a score_cutoff, so that it does not always produce a match, if two strings are too different.., what do you think? @Wen – zesla Mar 06 '18 at 05:22
This is incredible and saved me hours of manual work. Thank you very much. – campo Dec 05 '19 at 21:43
@campo yw :-) happy coding – BENY Dec 05 '19 at 22:00

score 0 · Answer 2 · answered Mar 05 '18 at 22:20

0

Not sure if fuzzy match is what you are looking for. Maybe make every name a proper name?

df1.Name = df1.Name.apply(lambda x: x.title())
df2.Name = df2.Name.apply(lambda x: x.title())

pd.merge(df1, df2, how='inner', on='Name')

answered Mar 05 '18 at 22:20

A.Kot

7,615
2
22
24

1

what about 'Anne' vs 'Annie', 'John Weber' vs 'John Weber'? – zesla Mar 05 '18 at 22:28
How can you assume that Anne is the same as Annie? – A.Kot Mar 06 '18 at 14:52

how to 'fuzzy' match strings when merge two dataframe in pandas

2 Answers2

Linked

Related