I am trying to look for potential matches in a PANDAS column full of organization names. I am currently using iterrows() but it is extremely slow on a dataframe with ~70,000 rows. After having looked through StackOverflow I have tried implementing a lambda row (apply) method but that seems to barely speed things up, if at all.
The first four rows of the dataframe look like this:
index org_name
0 cliftonlarsonallen llp minneapolis MN
1 loeb and troper llp newyork NY
2 dauby o'connor and zaleski llc carmel IN
3 wegner cpas llp madison WI
The following code block works but took around five days to process:
org_list = df['org_name']
from fuzzywuzzy import process
for index, row in df.iterrows():
x = process.extract(row['org_name'], org_list, limit=2)[1]
if x[1]>93:
df.loc[index, 'fuzzy_match'] = x[0]
df.loc[index, 'fuzzy_match_score'] = x[1]
In effect, for each row I am comparing the organization name against the list of all organization names, taking the top two matches, then selecting the second-best match (because the top match will be the identical name), and then setting a condition that the score must be higher than 93 in order to create the new columns. The reason I'm creating additional columns is that I do not want to simply replace values -- I'd like to double-check the results first.
Is there a way to speed this up? I read several blog posts and StackOverflow questions that talked about 'vectorizing' this code but my attempts at that failed. I also considered simply creating a 70,000 x 70,000 Levenshtein distance matrix and then extracting information from there. Is there a quicker way to generate the best match for each element in a list or PANDAS column?