I have a dataset of 200k rows with two columns: 1 - Unique customer id and address combination and 2 - revenue. The table is sorted by revenue and the goal is to clean up column 1 by doing a fuzzy match with itself to check if there are any close enough customer-address combinations with higher revenue that can be used to replace combinations with lesser revenue which most likely resulted from spelling differences.
Example:
In the above example the third row is very similar to the first row so I want it to take the value of the first row.
I have a working python code but it is too slow:
import pandas as pd
import datetime
import time
import numpy as np
from pyxdameraulevenshtein import normalized_damerau_levenshtein_distance, normalized_damerau_levenshtein_distance_ndarray
data = pd.read_csv("CustomerMaster.csv", encoding="ISO-8859-1")
# Create lookup column from the dataframe itself:
lookup_data=data['UNIQUE_ID']
lookup_data=pd.Series.to_frame(lookup_data)
# Start iterating on row by row on lookup data to find the first closest fuzzy match and write that back into dataframe:
start = time.time()
for index,row in data.iterrows():
if index%5000==0:print(index, time.time()-start)
for index2, row2 in lookup_data.iterrows():
ratio_val=normalized_damerau_levenshtein_distance(row['UNIQUE_ID'],row2['UNIQUE_ID'])
if ratio_val<0.15:
data.set_value(index,'UPDATED_ID',row2['UNIQUE_ID'])
data.set_value(index,'Ratio_Val',ratio_val)
break
Currently this fuzzy matching block of code is taking too long to run - about 8 hours for the first 15k rows with time increasing exponentially as one would expect. Any suggestion on how to more efficiently write this code?