Fuzzy matching not accurate enough with TF-IDF and cosine similarity

Question

I want to find similarities in a long list of strings. That is for every one string in the list, I need all similar strings in the same list. Earlier I used Fuzzywuzzy which provided good accuracy with the results I wanted by using the fuzzy.partial_token_sort_ratio. The only problem with this is the time it took since the list contains ~50k entries with up to 40 character strings. Time taken went up to 36 hours for 50k strings.

To improve my time I tried the rapidfuzz library which reduced the time to around 12 hours, giving same output as Fuzzywuzzy inspired from an answer here. Later I tried tf-idf and cosine similarity which gave some fantastic time improvements using the string-grouper library inspired from this blog. Closely investigating the results, the string-grouper method missed matches like 'DARTH VADER' and 'VADER' which were caught by fuzzywuzzy and rapidfuzz. This can be understood because of the way TF-IDF works and it seems to miss small strings altogether. Is there any workaround to improve the matching of string-grouper in this example or improve the time taken by rapidfuzz? Any faster iteration methods? Or any other ways to make the problem work?

The data is preprocessed and contains all strings in CAPS without special characters or numbers.

Time taken per iteration is ~1s. Here is the code for rapidfuzz:

from rapidfuzz import process, utils, fuzz

for index,rows in df.iterrows()
    list.append(process.extract(rows['names'],df['names'],scorer=fuzz.partial_token_set_ratio,score_cutoff=80))

Super fast solution, here is the code for string-grouper:

from string_grouper import match_strings
matches=match_strings(df.['names'])

Some similar problems with fuzzywuzzy are discussed here : (Fuzzy string matching in Python)

Also in general, are there any other programming languages that I can shift to, like R which can maybe speed this up? Just curious... Thanks for your help

Since your data is already preprocessed you should not do this inside rapidfuzz again. You can deactivate this by passing `processor=None` to `process.extract`. — maxbachmann, Nov 11 '20 at 11:22

score 2 · Answer 1 · answered Feb 20 '21 at 19:22

It is possible to change the minimum similarity with min_similarity and the size of n-grams with ngram_size in the match_strings function in string-grouper. For the specific example you could use a higher ngram_size, but that might cause you too miss other hits again.

score 1 · Answer 2 · edited Nov 25 '20 at 17:08

1

You should give tfidf-matcher a try, it didn't work for my specific use case but it might be a good fit for you.

edited Nov 25 '20 at 17:08

funie200

3,688
5
21
34

answered Nov 25 '20 at 11:10

jpjpjpjp

21
3

score -1 · Answer 3 · answered Jan 18 '21 at 20:01

tfidf matcher worked wonderfully for me. No hassle, just one function to call + you can set how many ngrams you'd like to split the word into, and the number of close matches you'd like + a confidence value in the match. It's also fast enough: looking up a string in a dataset of around 230k words took around 3 seconds at most.

Fuzzy matching not accurate enough with TF-IDF and cosine similarity

3 Answers3