I want to find similarities in a long list of strings. That is for every one string in the list, I need all similar strings in the same list. Earlier I used Fuzzywuzzy which provided good accuracy with the results I wanted by using the fuzzy.partial_token_sort_ratio. The only problem with this is the time it took since the list contains ~50k entries with up to 40 character strings. Time taken went up to 36 hours for 50k strings.
To improve my time I tried the rapidfuzz library which reduced the time to around 12 hours, giving same output as Fuzzywuzzy inspired from an answer here. Later I tried tf-idf and cosine similarity which gave some fantastic time improvements using the string-grouper library inspired from this blog. Closely investigating the results, the string-grouper method missed matches like 'DARTH VADER' and 'VADER' which were caught by fuzzywuzzy and rapidfuzz. This can be understood because of the way TF-IDF works and it seems to miss small strings altogether. Is there any workaround to improve the matching of string-grouper in this example or improve the time taken by rapidfuzz? Any faster iteration methods? Or any other ways to make the problem work?
The data is preprocessed and contains all strings in CAPS without special characters or numbers.
Time taken per iteration is ~1s. Here is the code for rapidfuzz:
from rapidfuzz import process, utils, fuzz
for index,rows in df.iterrows()
list.append(process.extract(rows['names'],df['names'],scorer=fuzz.partial_token_set_ratio,score_cutoff=80))
Super fast solution, here is the code for string-grouper:
from string_grouper import match_strings
matches=match_strings(df.['names'])
Some similar problems with fuzzywuzzy are discussed here : (Fuzzy string matching in Python)
Also in general, are there any other programming languages that I can shift to, like R which can maybe speed this up? Just curious... Thanks for your help