Using fuzzywuzzy to fix discrepancies in large panda

Question

I have a panda dataframe with 500k rows that contains paid out expenses. It looks like so:

As you can see, the 'LastName' column can contain entries should be the same but in practice they contain minor differences. My ultimate goal is to see how much was paid to each entity by doing a simple group_by and .sum. However, for that to work the entries under 'LastName' must be uniform.

I'm attempting to solve this problem using fuzzywuzzy.

First I take the unique vales from 'LastName' and save them to a list for comparison:

choices = expenditures_df['LastName'].astype('str').unique()

This leaves me with 50k unique entries from 'LastName' that I now need to compare the full 500k against.

Then I run through every line in the dataframe and look at it's similarity to each choice. If the similarity is high enough I overwrite the data in the dataframe with the entity name from choices.

for choice in choices:
    word = str(choice)
    for i in expenditures_df.index:
        if fuzz.ratio(word, str(expenditures_df.loc[i,'LastName'])) > 79:
            expenditures_df.loc[i, 'LastName'] = word

The problem, of course, is this is incredibly slow. So, I'd love some thoughts on accomplishing the same thing in a more efficient manner.

Seems to me like your code will end up swapping values back and forth. Not only would this be very slow, I'm not sure you're going to end up with what you want. — Chris, Dec 19 '19 at 17:11

score 0 · Answer 1 · answered Dec 19 '19 at 17:30

See: How to group words whose Levenshtein distance is more than 80 percent in Python

Based on this you can do something like:

import pandas as pd
from fuzzywuzzy import fuzz

df = pd.DataFrame({'expenditure':[1000,500,250,11,456,755],'last_name':['rakesh', 'zakesh', 'bikash', 'zikash', 'goldman LLC', 'oldman LLC']})

choices = df['last_name'].unique()


grs = list() # groups of names with distance > 80
for name in choices:
    for g in grs:
        if all(fuzz.ratio(name, w) > 80 for w in g):
            g.append(name)
            break
    else:
        grs.append([name, ])

name_map = []
for c, group in enumerate(grs):
    for name in group:
        name_map.append([c,name])


group_map = pd.DataFrame(name_map, columns=['group','name'])

df = df.merge(group_map, left_on='last_name',right_on='name')
df = df.groupby('group')['expenditure'].sum().reset_index()
df = df.merge(group_map.groupby('group')['name'].apply(list), on='group')

OUTPUT:

    group   expenditure name
0   0       1500        [rakesh, zakesh]
1   1       261         [bikash, zikash]
2   2       1211        [goldman LLC, oldman LLC]

Using fuzzywuzzy to fix discrepancies in large panda

1 Answers1