I have a panda dataframe with 500k rows that contains paid out expenses. It looks like so:
As you can see, the 'LastName' column can contain entries should be the same but in practice they contain minor differences. My ultimate goal is to see how much was paid to each entity by doing a simple group_by and .sum. However, for that to work the entries under 'LastName' must be uniform.
I'm attempting to solve this problem using fuzzywuzzy.
First I take the unique vales from 'LastName' and save them to a list for comparison:
choices = expenditures_df['LastName'].astype('str').unique()
This leaves me with 50k unique entries from 'LastName' that I now need to compare the full 500k against.
Then I run through every line in the dataframe and look at it's similarity to each choice. If the similarity is high enough I overwrite the data in the dataframe with the entity name from choices.
for choice in choices:
word = str(choice)
for i in expenditures_df.index:
if fuzz.ratio(word, str(expenditures_df.loc[i,'LastName'])) > 79:
expenditures_df.loc[i, 'LastName'] = word
The problem, of course, is this is incredibly slow. So, I'd love some thoughts on accomplishing the same thing in a more efficient manner.