11

I'm trying to fuzzy match two csv files, each containing one column of names, that are similar but not the same.

My code so far is as follows:

import pandas as pd
from pandas import DataFrame
from fuzzywuzzy import process
import csv

save_file = open('fuzzy_match_results.csv', 'w')
writer = csv.writer(save_file, lineterminator = '\n')

def parse_csv(path):

with open(path,'r') as f:
    reader = csv.reader(f, delimiter=',')
    for row in reader:
        yield row


if __name__ == "__main__":
## Create lookup dictionary by parsing the products csv
data = {}
for row in parse_csv('names_1.csv'):
    data[row[0]] = row[0]

## For each row in the lookup compute the partial ratio
for row in parse_csv("names_2.csv"):
    #print(process.extract(row,data, limit = 100))
    for found, score, matchrow in process.extract(row, data, limit=100):
        if score >= 60:
            print('%d%% partial match: "%s" with "%s" ' % (score, row, found))
            Digi_Results = [row, score, found]
            writer.writerow(Digi_Results)


save_file.close()

The output is as follows:

Name11 , 90 , Name25 
Name11 , 85 , Name24 
Name11 , 65 , Name29

The script works fine. The output is as expected. But what I am looking for is only the best match.

Name11 , 90 , Name25
Name12 , 95 , Name21
Name13 , 98 , Name22

So I need to somehow drop the duplicated names in column 1, based on the highest value in column 2. It should be fairly straightforward, but I can't seem to figure it out. Any help would be appreciated.

Michael Ohlrogge
  • 10,559
  • 5
  • 48
  • 76
Kvothe
  • 1,341
  • 7
  • 20
  • 33
  • 1
    A straight-forward way is to have two temp value to record current highest match and its score. Then only write to you file once after inner loop finished – Jerry Meng Aug 17 '15 at 17:10

3 Answers3

11

fuzzywuzzy's process.extract() returns the list in reverse sorted order , with the best match coming first.

so to find just the best match, you can set the limit argument as 1 , so that it only returns the best match, and if that is greater than 60 , you can write it to the csv, like you are doing now.

Example -

from fuzzywuzzy import process
## For each row in the lookup compute the partial ratio
for row in parse_csv("names_2.csv"):

    for found, score, matchrow in process.extract(row, data, limit=1):
        if score >= 60:
            print('%d%% partial match: "%s" with "%s" ' % (score, row, found))
            Digi_Results = [row, score, found]
            writer.writerow(Digi_Results)
StevenWernerCS
  • 839
  • 9
  • 15
Anand S Kumar
  • 88,551
  • 18
  • 188
  • 176
  • that for loop on `process.extract` yields `ValueError: not enough values to unpack (expected 3, got 2)`? I believe `found` should be removed. I would edit the answer but not sure if its a version issue? – StevenWernerCS Jun 26 '19 at 16:26
7

Several pieces of your code can be greatly simplified by using process.extractOne() from FuzzyWuzzy. Not only does it just return the top match, you can set a score threshold for it within the function call, rather than needing to perform a separate logical step, e.g.:

process.extractOne(row, data, score_cutoff = 60)

This function will return a tuple of the highest match plus the accompanying score if it finds a match satisfying the condition. It will return None otherwise.

Michael Ohlrogge
  • 10,559
  • 5
  • 48
  • 76
5

I just wrote the same thing for myself but in pandas....

import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

d1={1:'Tim','2':'Ted',3:'Sally',4:'Dick',5:'Ethel'}
d2={1:'Tam','2':'Tid',3:'Sally',4:'Dicky',5:'Aardvark'}

df1=pd.DataFrame.from_dict(d1,orient='index')
df2=pd.DataFrame.from_dict(d2,orient='index')

df1.columns=['Name']
df2.columns=['Name']

def match(Col1,Col2):
    overall=[]
    for n in Col1:
        result=[(fuzz.partial_ratio(n, n2),n2) 
                for n2 in Col2 if fuzz.partial_ratio(n, n2)>50
               ]
        if len(result):
            result.sort()    
            print('result {}'.format(result))
            print("Best M={}".format(result[-1][1]))
            overall.append(result[-1][1])
        else:
            overall.append(" ")
    return overall

print(match(df1.Name,df2.Name))

I have used a threshold of 50 in this - but it is configurable.

Dataframe1 looks like

    Name
1   Tim
2   Ted
3   Sally
4   Dick
5   Ethel

And Dataframe2 looks like

Name
1   Tam
2   Tid
3   Sally
4   Dicky
5   Aardvark

So running it produces the matches of

['Tid', 'Tid', 'Sally', 'Dicky', ' ']

Hope this helps.

Tim Seed
  • 5,119
  • 2
  • 30
  • 26