1

Looking out to find the highest accuracy percentage between 2 column values by using Fuzzy string matching.

I have 2 dataframes where i am trying to use fuzzy match between an specific columns values from both the dataframes.

Lets say df1 has 5 rows and df2 has 4 rows, i want to pick the values of each row of df1 and match with every row of df2 and find the highest accuracy. Suppose Row1 from df1 has been compared with all the rows from df2 so whichever the row from df2 has highest accuracy we will consider it as output. same to be considered for each row from df1.

Input Data :

Dataframe1

id_number  company_name        match_acc

IN2231D    AXN pvt Ltd
UK654IN    Aviva Intl Ltd
SL1432H    Ship Incorporations
LK0678G    Oppo Mobiles pvt ltd
NG5678J    Nokia Inc

Dataframe2

identity_no   Pincode   company_name

 IN2231        110030    AXN pvt Ltd
 UK654IN       897653    Aviva Intl Ltd
 SL1432        07658     Ship Incorporations
 LK0678G       120988    Oppo Mobiles Pvt Ltd

Want to find the highest accuracy percentage and submit the values in match_acc column.

Code i have been using currently:

df1 = pd.read_excel(open(r'input.xlsx', 'rb'), sheet_name='sheet1')
df2 = pd.read_excel(open(r'input.xlsx', 'rb'), sheet_name='sheet2')


from fuzzywuzzy import fuzz 
for index, row in df1.iterrows():
   df1['match_acc']= fuzz.partial_ratio(df1['id_number'], df2['identity_no'])

print(df1['match_acc'])

I have been using Fuzzywuzzy, If there is any other method as well pls suggest.

Any Suggestion.

NKJ
  • 457
  • 1
  • 4
  • 11
  • https://stackoverflow.com/questions/54494167/how-to-merge-on-approximate-strings this might help – Ezer K Apr 11 '21 at 19:45
  • @EzerK - Thanks, but this couldn't helped me out as i am looking to find out the accuracy percentage – NKJ Apr 11 '21 at 19:53

2 Answers2

3

TL;DR

  1. Cross-merge df1.id_number with df2.identity_no
  2. Compute each pair's thefuzz.fuzz.ratio or the faster rapidfuzz.fuzz.ratio
  3. map the groupby.max ratios back to df1
cross = df1[['id_number']].merge(df2[['identity_no']], how='cross')
cross['match_acc'] = cross.apply(lambda x: fuzz.ratio(x.id_number, x.identity_no), axis=1)
df1['match_acc'] = df1.id_number.map(cross.groupby('id_number').match_acc.max())

#   id_number          company_name  match_acc
# 0   IN2231D           AXN pvt Ltd         92
# 1   UK654IN        Aviva Intl Ltd        100
# 2   SL1432H   Ship Incorporations         92
# 3   LK0678G  Oppo Mobiles pvt ltd        100
# 4   NG5678J             Nokia Inc         43

Details

  1. The cross method of merge yields the cartesian product of df1.id_number and df2.identity_no:

    cross = df1[['id_number']].merge(df2[['identity_no']], how='cross')
    
    #    id_number identity_no
    # 0    IN2231D      IN2231
    # 1    IN2231D     UK654IN
    # 2    IN2231D      SL1432
    # ...
    # 17   NG5678J     UK654IN
    # 18   NG5678J      SL1432
    # 19   NG5678J     LK0678G
    

    For pandas < 1.2, how='cross' isn't available, so use how='outer' on a temporary key:

    cross = df1[['id_number']].assign(tmp=0).merge(df2[['identity_no']].assign(tmp=0), how='outer', on='tmp').drop(columns='tmp')
    
  2. apply the fuzzy calculator pair-wise:

    cross['match_acc'] = cross.apply(lambda x: fuzz.ratio(x.id_number, x.identity_no), axis=1)
    
    #    id_number identity_no  match_acc
    # 0    IN2231D      IN2231         92
    # 1    IN2231D     UK654IN         29
    # 2    IN2231D      SL1432         15
    # ...
    # 17   NG5678J     UK654IN         14
    # 18   NG5678J      SL1432          0
    # 19   NG5678J     LK0678G         43
    
  3. Use groupby.max to get the max scores per id_number and map them into df1.match_acc:

    df1['match_acc'] = df1.id_number.map(cross.groupby('id_number').match_acc.max())
    
    #   id_number          company_name  match_acc
    # 0   IN2231D           AXN pvt Ltd         92
    # 1   UK654IN        Aviva Intl Ltd        100
    # 2   SL1432H   Ship Incorporations         92
    # 3   LK0678G  Oppo Mobiles pvt ltd        100
    # 4   NG5678J             Nokia Inc         43
    
tdy
  • 36,675
  • 19
  • 86
  • 83
  • Didnt, Understood, How top 4 rows from df1 have accuracy 100% as an output, as if we can see the input dataframe only 2 row values from df1 are having 100% match - 'UK654IN', 'LK0678G' when comparing with df2. – NKJ Apr 11 '21 at 20:52
  • I just used the fuzzy function you posted. For example the output of `fuzz.partial_ratio('IN2231D', 'IN2231')` is 100%. – tdy Apr 11 '21 at 20:55
  • 1
    Thanks, In your opinion is there any other function apart from fuzzy.partial_ratio() . which can suit better in this case. – NKJ Apr 11 '21 at 20:59
  • 1
    @NKJ I updated the answer to change `partial_ratio()` to `ratio()` which gives you the Levenshtein scores `[92,100,92,100,43]`. That should be what you're looking for. – tdy Apr 11 '21 at 21:04
  • - Getting the error while executing this line of script "cross = df1[['id_number']].merge(df2[['identity_no']], how='cross')" ERROR: "pandas.errors.MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False", Please Suggest. – NKJ Apr 13 '21 at 15:11
  • I have tried using , left_on=None, right_on=None but still facing the same issue, Please suggest what needs to be updated – NKJ Apr 13 '21 at 18:44
  • 1
    @NKJ Not sure, maybe you have an older pandas version. Try this: `cross = df1[['id_number']].assign(tmp=0).merge(df2[['identity_no']].assign(tmp=0), how='outer', on='tmp').drop(columns='tmp')` – tdy Apr 13 '21 at 19:59
  • How to execute it only when column "id_number" is not blank – NKJ May 04 '21 at 04:15
  • 1
    @NKJ i would `dropna` from `id_number` column before computing, e.g. `cross = cross.dropna(subset=['id_number'])` – tdy May 04 '21 at 06:00
1

You can use the process function of fuzzywuzzy for one-to-many operations. Also, use rapidfuzz instead of fuzzywuzzy which has the same functionality but it performs some pre-processing based on string algorithms to provide faster results.

pip install rapidfuzz

# from fuzzywuzzy import fuzz, process
from rapidfuzz import fuzz, process # --> Use this for drastic exponential execution time improvements

df1 = pd.read_excel(open(r'input.xlsx', 'rb'), sheet_name='sheet1')
df2 = pd.read_excel(open(r'input.xlsx', 'rb'), sheet_name='sheet2')


for index, row in df1.iterrows():
    #extractOne will automatically extract the best one from the list of choices
    # you can provide which fuzzywuzzy scorer to use as well

    df1['match_acc']= process.extractOne(query=row['id_number'], choices=df2['identity_no'].tolist(), scorer=fuzz.partial_ratio)
print(df1['match_acc'])
Shreyesh Desai
  • 569
  • 4
  • 19