0

I have 4 columns which are BuisnessID, Name, BuisnessID_y, Name_y and I want to match Name with Name_y with a 90% similarity score, and if not 90% then drop those rows. Sample input

df

BusinessID      NAME        BusinessID_y  NAME_y

1013120869  MANOJ WANKHADE  1013404164    SLIMI
1013120869  MANOJ WANKHADE  1013831688    AMOL SHAHAKAR
1013120869  MANOJ WANKHADE  1013376009    PRATHMESH AGRAWAL
1013120869  MANOJ WANKHADE  1013376009    PRATHMESH AGRAWAL
1013120869  MANOJ WANKHADE  1013478922    AMBRISH PANDRIKAR

I am new to python and am not sure how to do this. Also, I have 500k records so any another approach other rapid-fuzz would be great

James Z
  • 12,209
  • 10
  • 24
  • 44
  • If you provide examples of the rapid-fuzz code you'd like to implement I can help further. But in short you need the pandas apply function. df['score'] = df[['NAME', 'NAME_y']].apply(... some function here...) – Kelvin Ducray Dec 10 '21 at 09:33
  • Rapid-fuzz is a library though I am open to using any similarity score ratio which is fast as I have some 5 lakh records @KelvinDucray – Sarthak Gupta Dec 10 '21 at 09:46

1 Answers1

2
>>> import pandas as pd
>>> import rapidfuzz
>>> df['matching_ratio'] = df.apply(lambda x:rapidfuzz.fuzz.ratio(x.NAME, x.NAME_y), axis=1).to_list()
>>> df

   BusinessID            NAME  BusinessID_y             NAME_y  matching_ratio
0  1013120869  MANOJ WANKHADE    1013404164              SLIMI       10.526316
1  1013120869  MANOJ WANKHADE    1013831688      AMOL SHAHAKAR       44.444444
2  1013120869  MANOJ WANKHADE    1013376009  PRATHMESH AGRAWAL       25.806452
3  1013120869  MANOJ WANKHADE    1013376009  PRATHMESH AGRAWAL       25.806452
4  1013120869  MANOJ WANKHADE    1013478922  AMBRISH PANDRIKAR       38.709677


>>> df[df.matching_ratio > 26] # change this '26' value to '90' as your requirmetn

   BusinessID            NAME  BusinessID_y             NAME_y  matching_ratio
1  1013120869  MANOJ WANKHADE    1013831688      AMOL SHAHAKAR       44.444444
4  1013120869  MANOJ WANKHADE    1013478922  AMBRISH PANDRIKAR       38.709677
Amir saleem
  • 1,404
  • 1
  • 8
  • 11