Python Pandas fuzzywuzzy 'join' of two datasets on string columns

Question

I am following the answer in this question that uses fuzzywuzzy to 'join' two data sets on string columns.

I am getting an error that i am having problems troubleshooting.

The error message seems to suggest key value problems. Assuming that was about null values, I filtered them out, but still get same error message.
The strings are company names that may have apostrophes, hyphens, periods, etc. I am assuming fuzzywuzzy can deal with those so not removing them first.

Any insight into what I should be looking for as next steps to troubleshoot this?

This is import of data from Excel files using Pandas:

import pandas as pd
from fuzzywuzzy import fuzz
import difflib 

vendor_file = "vendor.xlsx"
spr_file = "spr.xlsx"

xl_vendor = pd.ExcelFile(vendor_file)
xl_spr = pd.ExcelFile(spr_file)

vendor1 = xl_vendor.parse(xl_vendor.sheet_names[0])
spr1 = xl_spr.parse(xl_spr.sheet_names[0])

spr = spr1[pd.notnull(spr1['Contractor'])]
vendor = vendor1[pd.notnull(vendor1['Vendor Name'])]

And this is the part from other question that does matching and concatenation of matches to data set:

def get_spr(row):
    d = spr.apply(lambda x: fuzz.ratio(x['Vendor Name'], row['Contractor']) * 2 if row['Contractor'] == x['Vendor Name'] else 1, axis=1)
    d = d[d >= 75]
    if len(d) == 0:
        v = ['']*2
    else:
        v = spr.ix[d.idxmax(), ['Vendor Name', 'Pass/Fail']].values
    return pd.Series(v, index=['Vendor Name', 'Pass/Fail'])

# Must be unindented from function indent
pd.concat((vendor, vendor.apply(get_spr, axis=1)), axis=1)

The error is:

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-21-41973cb5c3d7> in <module>()
----> 1 pd.concat((vendor, vendor.apply(get_spr, axis=1)), axis=1)

C:\Anaconda\lib\site-packages\pandas\core\frame.pyc in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
   3716                     if reduce is None:
   3717                         reduce = True
-> 3718                     return self._apply_standard(f, axis, reduce=reduce)
   3719             else:
   3720                 return self._apply_broadcast(f, axis)

C:\Anaconda\lib\site-packages\pandas\core\frame.pyc in _apply_standard(self, func, axis, ignore_failures, reduce)
   3806             try:
   3807                 for i, v in enumerate(series_gen):
-> 3808                     results[i] = func(v)
   3809                     keys.append(v.name)
   3810             except Exception as e:

<ipython-input-19-62cc0c6c6daf> in get_spr(row)
      1 def get_spr(row):
----> 2     d = spr.apply(lambda x: fuzz.ratio(x['Vendor Name'], row['Contractor']) * 2 if row['Contractor'] == x['Vendor Name'] else 1, axis=1)
      3     d = d[d >= 75]
      4     if len(d) == 0:
      5         v = ['']*2

C:\Anaconda\lib\site-packages\pandas\core\frame.pyc in apply(self, func, axis, broadcast, raw, reduce, args, **kwds)
   3716                     if reduce is None:
   3717                         reduce = True
-> 3718                     return self._apply_standard(f, axis, reduce=reduce)
   3719             else:
   3720                 return self._apply_broadcast(f, axis)

C:\Anaconda\lib\site-packages\pandas\core\frame.pyc in _apply_standard(self, func, axis, ignore_failures, reduce)
   3806             try:
   3807                 for i, v in enumerate(series_gen):
-> 3808                     results[i] = func(v)
   3809                     keys.append(v.name)
   3810             except Exception as e:

<ipython-input-19-62cc0c6c6daf> in <lambda>(x)
      1 def get_spr(row):
----> 2     d = spr.apply(lambda x: fuzz.ratio(x['Vendor Name'], row['Contractor']) * 2 if row['Contractor'] == x['Vendor Name'] else 1, axis=1)
      3     d = d[d >= 75]
      4     if len(d) == 0:
      5         v = ['']*2

C:\Anaconda\lib\site-packages\pandas\core\series.pyc in __getitem__(self, key)
    519     def __getitem__(self, key):
    520         try:
--> 521             result = self.index.get_value(self, key)
    522 
    523             if not np.isscalar(result):

C:\Anaconda\lib\site-packages\pandas\core\index.pyc in get_value(self, series, key)
   1607                     raise InvalidIndexError(key)
   1608                 else:
-> 1609                     raise e1
   1610             except Exception:  # pragma: no cover
   1611                 raise e1

KeyError: ('Contractor', u'occurred at index 3', u'occurred at index 0')

Edited to add dataframe cols:

spr: 'Contractor', 'Pass/Fail'
vendor: 'Vendor Name'

Edited to add corrected revision of matching based on davidshinn answer:

def get_spr(row):
    d = spr.apply(lambda x: fuzz.ratio(x['Contractor'], row['Vendor Name']) * 2 if row['Vendor Name'] == x['Contractor'] else 1, axis=1)
    d = d[d >= 75]
    if len(d) == 0:
        v = ['']*2
    else:
        v = spr.ix[d.idxmax(), ['Contractor', 'Pass/Fail']].values
    return pd.Series(v, index=['Contractor', 'Pass/Fail'])

Could you please provide the column names for the vendor and spr dataframes. Are you certain that `Contractor` is a valid column in the `vendor` dataframe, because that is the dataframe `row['Contractor']` is attempting access. — davidshinn, Nov 20 '15 at 15:24
Ok that helped. I had "Vendor Name" and "Contractor" reversed in matching part. (The original answer referenced in other question has exact same column names in both datasets so i was just guessing which was which.) I had to modify as noted in my edit. Can you make your comment to answer i can accept. thx! — curtisp, Nov 20 '15 at 16:06
I put it as an answer, not really answering the posts question, but more of a bug fix. Good luck. — davidshinn, Nov 20 '15 at 16:08

score 0 · Accepted Answer · answered Nov 20 '15 at 16:07

0

Could you please provide the column names for the vendor and spr dataframes. Are you certain that Contractor is a valid column in the vendor dataframe, because that is the dataframe row['Contractor'] is attempting access.

answered Nov 20 '15 at 16:07

davidshinn

1,916
1
16
17

Python Pandas fuzzywuzzy 'join' of two datasets on string columns

1 Answers1