Is there a way to reduce the run time for this code to remove partial duplicates?

Question

So this is the code for removing partial duplicates within the same column of a data, however, i guess because of the process of matching each row with other rows, the code takes a lot of time to run on a dataset of even 2000 rows. Is there any way to reduce the run time?

Here's the code-

from fuzzywuzzy import fuzz,process

rows = ["I have your Body Wash and I wonder if it contains animal ingredients. Also, which animal ingredients? I prefer not to use product with animal ingredients.","This also doesn't have the ADA on there. Is this a fake toothpaste an imitation of yours?","I have your Body Wash and I wonder if it contains animal ingredients. I prefer not to use product with animal ingredients.","I didn't see the ADA stamp on this box. I just want to make sure it was still safe to use?","Hello, I was just wondering if the new toothpaste is ADA approved? It doesn’t say on the packaging","Hello, I was just wondering if the new toothpaste is ADA approved? It doesn’t say on the box."]

clean = []
threshold = 80 # this is arbitrary
for row in rows:
    # score each sentence against each other sentence
    # [('string', score),..]
    scores = process.extract(row, rows, scorer=fuzz.token_set_ratio)
    # basic idea is if there is a close second match we want to evaluate 
    # and keep the longer of the two
    if scores[1][1] > threshold:
        clean.append(max([x[0] for x in scores[:2]],key=len))
    else:
        clean.append(scores[0][0])
# remove dupes
clean = set(clean)

Does this answer your question? [Vectorizing or Speeding up Fuzzywuzzy String Matching on PANDAS Column](https://stackoverflow.com/questions/52631291/vectorizing-or-speeding-up-fuzzywuzzy-string-matching-on-pandas-column) — maxbachmann, Jul 16 '21 at 06:50
Hi, yes it does help a bit. thanks for pointing to this direction ! wasn't aware of this library. I think i'll give it a try , since what i'm trying to do is implement it within the same column of pandas database and i want the code to give a clean output file, after removing partial duplicates. I dont want the code to just display the duplicates. — Shrumo, Jul 22 '21 at 02:50
@maxbachmann Hi, I am facing some issue while installing rapidfuzz. I tried both pip and conda, but i'm getting this error in the prompt - "Preparing transaction: done Verifying transaction: failed EnvironmentNotWritableError: The current user does not have write permissions to the target environment. environment location: C:\ProgramData\Anaconda3". — Shrumo, Jul 23 '21 at 10:31

Is there a way to reduce the run time for this code to remove partial duplicates?

0 Answers0