4

I have 2 dataframes of restaurant information to merge.

df1 = pd.DataFrame ({'Restaurant_Name': ['Apple', 'Banana', 'Orange', 'apple','apple1'],
                     'Postal Code': [12345, 12345, 54321, 54321,1111]})
df2 = pd.DataFrame ({'Restaurant_Name': ['apple', 'apple', 'Banana'],
                     'Postal Code': [12345, 54321, 12345],
                    'Phone':[100,200,300]})

d1:
df1


df2:
df2

  • Each restaurant has a postal code (not unique, 2 restaurants can be located in the same place). So I cannot merge the dataframes based on postal code.
  • But restaurants sharing the same postal code can be distinguished by their names.
  • Restaurant names are spelt slightly different so I cannot merge based on restaurant names too

Ideally I want to produce a table that looks like this: df3

I tried to match the restaurant names based on fuzzy matching followed by a match of postal code, but was not able to get a very accurate result. I also tried to concatenate the restaurant name with postal code for each of the dataframe and do a fuzzy matching of the concatenated result but I don't think this is the best way.

Is there any way to achieve 100% accuracy in matching the two dataframes?

user9826192
  • 75
  • 3
  • 8
  • _I tried to match the restaurant names based on fuzzy matching followed by a match of postal code, but was not able to get a very accurate result._ Can you elaborate? Could you share the code from that attempt? Also, please include code/data as text in the post itself, no as images. – AMC Dec 12 '19 at 07:03

2 Answers2

4

Check difflib.get_close_matches().

I tried this using your sample dataframe. Does it help?

import pandas as pd
import difflib

df1 = pd.DataFrame ({'Restaurant_Name': ['Apple', 'Banana', 'Orange', 'apple','apple1'],
                     'Postal Code': [12345, 12345, 54321, 54321,1111]})
df2 = pd.DataFrame ({'Restaurant_Name': ['apple', 'apple', 'Banana'],
                     'Postal Code': [12345, 54321, 12345],
                    'Phone':[100,200,300]})

df1['key'] = df1['Restaurant_Name']+df1['Postal Code'].astype(str)
df2['key'] = df2['Restaurant_Name']+df2['Postal Code'].astype(str)
df2['key'] = df2['key'].apply(lambda x: difflib.get_close_matches(x, df1['key'])[0])

df1.merge(df2, on='key', how='outer')[['Restaurant_Name_x','Restaurant_Name_y','Postal Code_x','Phone']]

Output:

  Restaurant_Name_x Restaurant_Name_y  Postal Code_x  Phone
0             Apple             apple          12345  100.0
1            Banana            Banana          12345  300.0
2            Orange               NaN          54321    NaN
3             apple             apple          54321  200.0
4            apple1               NaN           1111    NaN

As you said, I did concatenate the restaurant name with postal code to get a unique combination.

davidbilla
  • 2,120
  • 1
  • 15
  • 26
  • Hello, I used difflib.get_close_matches as well and set, the cut off at 0.8. It gives a relatively accurate match, but not 100% accuracy, for example "Apple Restaurant" will be matched with "Pineapple Restaurant" due to high similarity (if they are located at the same place). – user9826192 Dec 13 '19 at 09:48
  • Yes, that is how `difflib.get_close_matches()` works. You can use `difflib.SequenceMatcher()` to get the score. But still there are a other advanced libraries like `fuzzywuzzy`. – davidbilla Dec 13 '19 at 13:59
  • i tried fuzzywuzzy as well, not much difference from difflib. I guess it's not very possible to obtain 100% accuracy. Thank you for your help anyway! – user9826192 Dec 16 '19 at 04:13
  • Guess this has a possible AI/ML solution – davidbilla Dec 16 '19 at 04:20
0

One option is to use some Fuzzy String Matching module like fuzzywuzzy.

install required librarys

pip install fuzzywuzzy
pip install python-Levenshtein

now find name matches like below

from fuzzywuzzy import fuzz
from fuzzywuzzy import process
match_level = 90
def find_details(row):
    sub_df = df2[df2['Postal Code'] == row['Postal Code']].copy()
    sub_df['match'] = sub_df['Restaurant_Name'].apply(lambda x: fuzz.token_sort_ratio(row['Restaurant_Name'], x))
    sub_df = sub_df[sub_df['match'] >= match_level].sort_values(['match'], ascending=[False])
    phone = ''
    if sub_df.shape[0] > 0:
        phone = sub_df['Phone'].values[0]

    ret = {
        'phone' : phone
    }
    return pd.Series(ret)
df1.merge(df1.apply(lambda row: find_details(row), axis=1),  left_index=True, right_index=True)

which gives you the following output

    Restaurant_Name     Postal Code     phone
0   Apple   12345   100
1   Banana  12345   300
2   Orange  54321   
3   apple   54321   200
4   apple1  1111    
Prince Francis
  • 2,995
  • 1
  • 14
  • 22