0

I have been working on an Excel file and looking out to find out accuracy percentage by using Fuzzy match / any other method from another sheet using an dataframe. Matching other column values using one column having unique values.

Input values:

Sheet 1

identity_no  address            Pincode   company_name

 IN2231      Delhi, Indi        110030    AXN pvt Ltd
 UK654       London, Uk         897653    Aviva Intl Ltd
 SL1432      Colombo, Srilanka  07658     Ship Incorporations
 LK0678      Libya, Sns         5674332   Oppo Mobiles pvt ltd

Master Data Sheet 2

identity_no  address            Pincode   company_name

 IN2231      Delhi, India       110030    AXN pvt Ltd
 UK654       London, Uk         897653    Aviva Intl Ltd
 SL1432      Colombo, Srilanka  07658     Ship Incorporations

Looking out to map the column values from Master Data Sheet 2 and update the values into Sheet 1 only when matching accuracy percentage is more than 75% . based on identity_no. Also if any identity_no is not present in Sheet 2, We also need to update into sheet 2.

I have tried to use Fuzzy match bot couldn't worked

Script i have been trying to use so far is

df = pd.read_excel(open(r'input.xlsx', 'rb'), sheet_name='sheet1')
df2 = pd.read_excel(open(r'master_data.xlsx', 'rb'), sheet_name='sheet2')

from difflib import SequenceMatcher
lst = df2[['identity_no','address', 'Pincode', 'company_name']]

def closest(s):
    highest = 0
    result = ''
    for i in lst:
        temp = SequenceMatcher(None, s, i).ratio() #Similarity Ratio
        if temp > highest:
            highest = temp
            result = i
    return result

How to check for the matching accuracy percentage and only execute the further process when it is more than 75%

df[['identity_no','address', 'Pincode', 'company_name']] = df[['identity_no','address', 'Pincode', 'company_name']] .apply(lambda x: closest(x))
Manz
  • 593
  • 5
  • 23
  • Why don't you just merge on `'identity_no', 'Pincode', 'company_name'` and then drop whichever `address` column is less clean? – m13op22 Mar 31 '21 at 18:54
  • @HS-nebula - tried it but didnt worked for me in this case as i have to map from sheet 2 to sheet 1 only when matching percentage is more than 75%. – Manz Mar 31 '21 at 19:06
  • You can check out this [question](https://stackoverflow.com/questions/13636848/is-it-possible-to-do-fuzzy-match-merge-with-python-pandas) and this [question](https://stackoverflow.com/questions/31734484/using-difflib-sequencematcher-ratio-to-merge-in-pandas) then. – m13op22 Mar 31 '21 at 19:25
  • I'd also point out that you could do the merge on the three columns, then create a 5th column that contains the ratio between the address columns, and drop rows that are greater than 0.75 - might be easier than fuzzy merging – m13op22 Mar 31 '21 at 19:29

0 Answers0