So this is the code for removing partial duplicates within the same column of a data, however, i guess because of the process of matching each row with other rows, the code takes a lot of time to run on a dataset of even 2000 rows. Is there any way to reduce the run time?
Here's the code-
from fuzzywuzzy import fuzz,process
rows = ["I have your Body Wash and I wonder if it contains animal ingredients. Also, which animal ingredients? I prefer not to use product with animal ingredients.","This also doesn't have the ADA on there. Is this a fake toothpaste an imitation of yours?","I have your Body Wash and I wonder if it contains animal ingredients. I prefer not to use product with animal ingredients.","I didn't see the ADA stamp on this box. I just want to make sure it was still safe to use?","Hello, I was just wondering if the new toothpaste is ADA approved? It doesn’t say on the packaging","Hello, I was just wondering if the new toothpaste is ADA approved? It doesn’t say on the box."]
clean = []
threshold = 80 # this is arbitrary
for row in rows:
# score each sentence against each other sentence
# [('string', score),..]
scores = process.extract(row, rows, scorer=fuzz.token_set_ratio)
# basic idea is if there is a close second match we want to evaluate
# and keep the longer of the two
if scores[1][1] > threshold:
clean.append(max([x[0] for x in scores[:2]],key=len))
else:
clean.append(scores[0][0])
# remove dupes
clean = set(clean)