0

My data frame is

enter image description here

Matcher = df2['Account Name']

match = if df1['Billing Country'] == df2['Billing Country'] (process.extractOne(df1['Account Name'], Matcher))

The above code is not working but I want to do the fuzzy match of account name only when the country is matching.

Maneet Giri
  • 185
  • 3
  • 18
  • What you should probably do a full outer join on both dataframes, compute the FuzzyRatio for each combination, and filter the result to those that only have 100% match. – Sam May 03 '16 at 16:55
  • Sam- Are you suggesting is that I do an outer merge on Billing country and then look for the fuzzy match for each combination? – Maneet Giri May 03 '16 at 18:11

2 Answers2

1

Here's what I am suggesting. First, a full cartesian join on the two dfs:

df1.loc[:, 'MergeKey'] = 1 #create a mergekey
df2.loc[:, 'MergeKey'] = 1 #it is the same for both so that when you merge you get the cartesian product
#merge them to get the cartesian product (all possible combos)
merged = df1.merge(df2, on = 'MergeKey', suffixes = ['_1', '_2'])

Then, calculate the fuzz ratio for each combo:

def fuzzratio(row):
    try: #avoid errors for example on NaN's
        return fuzz.ratio(row['Billing Country_1'], row['Billing Country_2'])
    except:
        return 0. #you'll want to expiriment w/o the try/except too
merged.loc[:, 'Ratio'] = merged.apply(fuzzratio, axis = 1) #create ratio column by applying function

Now you should have a df with the ratio between all possible combinations of df1['Billing Country'] and df2['Billing Country']. Once there, simply filter to get the ones where the ratio is 100%:

result = merged[merged.Ratio ==1]
Sam
  • 4,000
  • 20
  • 27
  • I am sorry. It is harder for me to understand. Could you please explain what does df1.loc[:, 'MergeKey'] = 1 does. and is 'MergeKey' is 'Account Name' in this code – Maneet Giri May 04 '16 at 18:42
  • My code is this `merged_file = pd.merge(df2, df1, on='Billing Country', how = 'outer')` – Maneet Giri May 04 '16 at 18:57
  • I have 3,467,624 possible matches but I can't save it in excel now. `merged_file.to_excel('merged_file.xlsx')` – Maneet Giri May 04 '16 at 19:04
  • What Im doing with the merge key is creating a column of 1's in each df. Then I join on these columns. This means, if you have 10 rows in df1 and 10 rows in df2, you end up with 100 rows in "merged". This will give you every possible combo of Country_1 and Country_2. Then you can caluclate the fuzzratio for every possible combo. – Sam May 04 '16 at 19:13
  • so sam what if you wanted to see the ratio of another pair of columns? would you just add another return statement? or would you have to rewrite out the whole def statement? – Cannon Jun 14 '17 at 19:32
  • @Cannon you could return a tuple of ratios and unpack them into multiple columns. See for example the answer here about how to return two new columns from an apply: https://stackoverflow.com/questions/36599697/dataframe-apply-method-to-return-multiple-elements-series/36600318#36600318 – Sam Jun 15 '17 at 11:16
0

I figured it out in slightly different way.

first I merged using

merged_file = pd.merge(df2, df1, on='Billing Country', how = 'left')

and when I had all the possible matches.

I applies fuzzywuzzy's

`Reference_data= df2['Account Name']`

`Result = process.extractOne(df1, choices)`

As the above string gave me the closest possible match for each value I wanted to lookup for. Later I added one more string in order to calculate the ratio.

Result['ratio']= fuzz.ratio(Result['Account Name_x'],Result['Account Name_y'] )
Maneet Giri
  • 185
  • 3
  • 18