Python Fuzzy matching strings in list performance

Question

I'm checking if there are similar results (fuzzy match) in 4 same dataframe columns, and I have the following code, as an example. When I apply it to the real 40.000 rows x 4 columns dataset, keeps running in eternum. The issue is that the code is too slow. For example, if I limite the dataset to 10 users, it takes 8 minutes to compute, while for 20, 19 minutes. Is there anything I am missing? I do not know why this take that long. I expect to have all results, maximum in 2 hours or less. Any hint or help would be greatly appreciated.

from fuzzywuzzy import process
dataframecolumn = ["apple","tb"]
compare = ["adfad","apple","asple","tab"]
Ratios = [process.extract(x,compare) for x in dataframecolumn]
result = list()
for ratio in Ratios:
    for match in ratio:
        if match[1] != 100:
            result.append(match)
            break
print (result)

Output: [('asple', 80), ('tab', 80)]

try to implement Levenshtein Distance by your own and see the difference between execution times. maybe the library that you're using has problems with time efficiency — jennifer lawrence, May 08 '19 at 12:49
I've solved this problem in past but would need more info here. Are you looking for string similarity using ratio or partial ratio or both ? Are you using extract function just to speed up the process ? — Atendra Gautam, May 10 '19 at 12:09
@Atendra using ratio (not partial), and I amb using the extract function to speed it up, yes. As far as I can achieve the expected result, I am open to any new code. — ecp, May 10 '19 at 15:08

Atendra Gautam · Accepted Answer · 2019-08-24T15:23:18.223

12

Major speed improvements come by writing vectorized operations and avoiding loops

Importing necessary package

from fuzzywuzzy import fuzz
import pandas as pd
import numpy as np

Creating dataframe from first list

dataframecolumn = pd.DataFrame(["apple","tb"])
dataframecolumn.columns = ['Match']

Creating dataframe from second list

compare = pd.DataFrame(["adfad","apple","asple","tab"])
compare.columns = ['compare']

Merge - Cartesian product by introducing key(self join)

dataframecolumn['Key'] = 1
compare['Key'] = 1
combined_dataframe = dataframecolumn.merge(compare,on="Key",how="left")
combined_dataframe = combined_dataframe[~(combined_dataframe.Match==combined_dataframe.compare)]

Vectorization

def partial_match(x,y):
    return(fuzz.ratio(x,y))
partial_match_vector = np.vectorize(partial_match)

Using vectorization and getting desired result by putting threshold on score

combined_dataframe['score']=partial_match_vector(combined_dataframe['Match'],combined_dataframe['compare'])
combined_dataframe = combined_dataframe[combined_dataframe.score>=80]

Results

+--------+-----+--------+------+
| Match  | Key | compare | score
+--------+-----+--------+------+
| apple  | 1   |   asple |    80
|  tb    | 1   |   tab   |    80
+--------+-----+--------+------+

edited Aug 24 '19 at 15:23

answered May 16 '19 at 09:29

Atendra Gautam

465
3
11

Executing your script, the result is different from what you give. It returns ```Match Key compare score 1 apple 1 apple 100``` – ecp May 20 '19 at 07:50
forgot to include negation. Corrected it now. try and let me know. Its now giving same output. – Atendra Gautam May 20 '19 at 10:08
This doesn't work with very large datasets. I have 50,000 strings in a column to compare to another column of 400,000 strings. There is a `Memoryerror` when merging these with your code `Merge - Cartesian product by introducing key(self join)` – SCool Jan 20 '20 at 12:29
@SCool since data is relatively moderate , number of combinations would increase significantly. Can you reduce these combinations through some filter or logic, Do you really need to match all 50 K strings with 4 Lac Strings ? If yes, you would need more RAM. – Atendra Gautam Jan 20 '20 at 14:10
Very useful thanks! Just to add I had to try different fuzzy methods with my data, `token_set_ratio` worked best. https://towardsdatascience.com/natural-language-processing-for-fuzzy-string-matching-with-python-6632b7824c49 – Mike Honey May 03 '20 at 04:52