How to merge on approximate strings?

Question

I would like to merge 2 Data Frames on approximate country name with merge as of but i am getting the following error:

TypeError:'NoneType' object is not callable

Please see the illustrative code below:

cl =  {'Country' : ["Brazil", "US", "Russia"], 'BL?':['No', 'No','Yes']}
clist = pd.DataFrame.from_dict(cl)

cd = {'Country' : ["Braizl", "us", "Rusia"]}
cdata  = pd.DataFrame.from_dict(cd)

clist = clist.sort_values('Country')
cdata = cdata.sort_values('Country')


cdata = pd.merge_asof(cdata,clist,on='Country')

The expected result would merge the two dfs and the cdata df would have the 'BL?' column with YES/NO values.

Thank you in advance!

`inplace=True` makes sort_values returns `None`, because you are specifying the method to do it inplace — Dani Mesejo, Feb 02 '19 at 15:01
True, I get a nonetype object is not callable now after removing the inplace — TPguru, Feb 02 '19 at 15:03
I think `merge_asof` only works with integers. If you wan't to find the closest a string is to another you could use the levenshtein distance, but I am not aware of any implementation of it in the Pandas library. — ldoe, Feb 02 '19 at 15:10
see this: https://stackoverflow.com/questions/13636848/is-it-possible-to-do-fuzzy-match-merge-with-python-pandas — Dani Mesejo, Feb 02 '19 at 15:22

It_is_Chris · Accepted Answer · 2019-02-02T16:39:32.637

This should get you close but it will not be 100% accurate. You can use fuzzywuzzy. fuzzywuzzy uses Levenshtein distance to calculate the difference between two strings:

from fuzzywuzzy import process

# create a choice list
choices = clist['Country'].values.tolist()

# apply fuzzywuzzy to each row using lambda expression
cdata['Close Country'] = cdata['Country'].apply(lambda x: process.extractOne(x, choices)[0])

# merge
cdata.merge(clist, left_on='Close Country', right_on='Country')


  Country_x Close Country Country_y  BL?
0    Braizl        Brazil    Brazil   No
1     Rusia        Russia    Russia  Yes
2        us            US        US   No

You can even return the percent match and only keep values > n if you want to only keep matches say greater than 85%

add percent match

from fuzzywuzzy import process

# create a choice list
choices = clist['Country'].values.tolist()

# apply fuzzywuzzy to each row using lambda expression
cdata['Close Country'] = cdata['Country'].apply(lambda x: process.extractOne(x, choices))

# add percent match wiht apply
cdata[['Close Country', 'Percent Match']] = cdata['Close Country'].apply(pd.Series)

# merge
cdata.merge(clist, left_on='Close Country', right_on='Country')

  Country_x Close Country  Percent Match Country_y  BL?
0    Braizl        Brazil             83    Brazil   No
1     Rusia        Russia             91    Russia  Yes
2        us            US            100        US   No

you can do boolean indexing either before the merge to remove bad match then merge:

cdata[['Close Country', 'Percent Match']] = cdata['Close Country'].apply(pd.Series)
cdata = cdata[cdata['Percent Match']>85]

or you can do it after the merge:

merge = cdata.merge(clist, left_on='Close Country', right_on='Country')
merge[merge['Percent Match'] > 85]

fuzzywuzzy returns the percent match as part of the process function. In the first example I removed it by calling the first element of the tuple: process.extractOne(x, choices)[0]

Try replacing ```from fuzzywuzzy import process``` to ```from rapidfuzz import process```. You would see drastic execution time improvements with same results — Shreyesh Desai, Apr 13 '21 at 15:52

score 1 · Answer 2 · answered Feb 02 '19 at 15:41

Given your example I came to a solution. This is not very pythonic but it works! (assuming you have a matching country name in clist for every cdata mispelled country)

def get_closest(x, column):
    tmp = 1000
    for i2, r2 in clist.iterrows():
        levenshtein = editdistance.eval(x,r2['Country'])
        if levenshtein <= tmp:
            tmp = levenshtein
            res = r2

    return res['BL?']

cdata['BL'] = cdata['Country'].apply(lambda x: get_closest(x, clist))

Output :

   Country   BL
0  Braizl   No
1      us   No
2   Rusia  Yes

I am using the editdistance library to compute the levenshtein distance. You can install it with pip :

pip install editdistance

Well there should be two. Can you post what your code looks like ? — ldoe, Feb 02 '19 at 16:01

How to merge on approximate strings?

2 Answers2

add percent match

Linked