0

I have df1 and df2. I want to use fuzzywuzzy to string match column A in df1 to column A in df2, and return an ID in column B of df2 based on a certain ratio match.

For example:

df1 looks like this:


Name

Sally sells Seashells


df2 looks like this:


Name | ID

Sally slls sshells | 28904


What I'm trying to do is compare everything in column A in df1 to find a match in column A in df2 and return the ID from column B in df2.

I would like to be able to set the criteria of the fuzzy ratio. For example: I only want it to return an ID if the ratio is above 50.

My current code:

import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
df1=pd.read_csv('C:\\Users\\nkurdob\\Desktop\\Sheet1.csv')
df2=pd.read_csv('C:\\Users\\nkurdob\\Desktop\\Sheet2.csv')


for i in range(len(df1)):
    em = df1['A'][i]
    test = fuzz.partial_ratio(em, df2['A']) 
    if test > 50:
        print df1['A'][i]==df2['B']
Jim Dennis
  • 17,054
  • 13
  • 68
  • 116
Window
  • 87
  • 1
  • 8

1 Answers1

0

Firstly thanks for the question, I have never used fuzzywuzzy before...

This is my take on your question.

Here I am trying to match the name column in 2 data frames, and I will only show results which have a greater than 50 score.

As I would then concat these results (or replace a column) I add blank values where there are no matches.... obviously you may or may not want to do this.

import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

d1={1:'Tim','2':'Ted',3:'Sally',4:'Dick',5:'Ethel'}
d2={1:'Tam','2':'Tid',3:'Sally',4:'Dicky',5:'Aardvark'}

df1=pd.DataFrame.from_dict(d1,orient='index')
df2=pd.DataFrame.from_dict(d2,orient='index')

df1.columns=['Name']
df2.columns=['Name']

def match(Col1,Col2):
    overall=[]
    for n in Col1:
        result=[(fuzz.partial_ratio(n, n2),n2) 
                for n2 in Col2 if fuzz.partial_ratio(n, n2)>50
               ]
        if len(result):
            result.sort()    
            print('result {}'.format(result))
            print("Best M={}".format(result[-1][1]))
            overall.append(result[-1][1])
        else:
            overall.append(" ")
    return overall

print(match(df1.Name,df2.Name))

When this is run you should see output like this.

result [(67, 'Tam'), (67, 'Tid')]
Best M=Tid
result [(67, 'Tid')]
Best M=Tid
result [(100, 'Sally')]
Best M=Sally
result [(100, 'Dicky')]
Best M=Dicky
['Tid', 'Tid', 'Sally', 'Dicky', ' ']

I am obviously only showing the intermediate results so I can demonstrate the value matching clause is working.

I then sort the list of tuples (as they were stored with score-then-value order), take the last one (you can reverse the sort and take the top value up to you), I then take the 2nd element ([1]) from the tuple.

This should work for any 2 string Columns, but I have not tested this.

Tim Seed
  • 5,119
  • 2
  • 30
  • 26
  • Hey can you please try to answer this thread ? https://stackoverflow.com/questions/68413792/how-to-sort-dataframe2-according-to-dataframe1-with-fuzzywuzzy – Titan Jul 17 '21 at 17:04