My solution for categorizing company names is too slow

Question

The purpose of the code is to cluster similar company name's together so I can give the similar names codes. I still need to do some adjustments for the desired output but the problem I have right now is that it's too slow. For 20k rows it takes 2 minutes to complete. For 100k rows it takes 20 minutes to complete (my laptop is very slow, I guess in a faster one the timing would be better). The time it needs to complete it accelerates very quickly with the row count. The program maybe is going to be used for 1+ million rows. It would take days to complete that.

I tried multiprocessing but I don't know if it would work for this problem. I think thelogic behind the name_category and get_most_similar_word is not good for multiprocessing. Or I just wasn't able to implement it correctly.

Some of the other algorithms I had in mind:

1- TF-IDF and Cosine Similarity 2- Trie Data Structure 3- N-grams

I tried implementing them but the output was poor so I stopped because I'm not sure if they would be able to give faster results without a poor output.

def get_most_similar_word(word, wordlist):
    top_similarity = 0.0
    most_similar_word = word
    word = word
    for candidate in wordlist:
        if candidate == word:
            top_similarity = 100
            most_similar_word = candidate
            break
        similarity = fuzz.token_sort_ratio(word, candidate)
        if similarity > top_similarity:
            top_similarity = similarity
            most_similar_word = candidate
            
    return most_similar_word, top_similarity
    
# String cleaning - df['name'] to df['clean_name'] (I used strip().lower() and extermination of some common words like INC and LTD (only exterminates these words if they have ./-_, before and after them. And afterwards it exterminates those characters(./-_,) too.) But before doing all that we sort all the names by frequencies so when we are creating the name_category, our accuracy can be better.)

name_category = {}
for index, name in df['clean_name'].items():
    most_similar_categorized_name, top_similarity = get_most_similar_word(name, name_category.keys())
    if top_similarity > 90:
        name_category.setdefault(most_similar_categorized_name, []).append(index)
    else:
        name_category[name] = [index]

I need a faster and preferably better name categorization algorithm for my project.

I need it to be at least 2 times faster than this. But even a small improvement would be a great help.

Welcome to Stackoverflow, I'll encourage you to ask the question here in the discussion instead https://stackoverflow.com/collectives/nlp/beta/discussions. Most probably the questions would be flagged as "asking for tool/fix recommendation" as it is now. — alvas, Aug 28 '23 at 11:44
Rather than fuzzywuzzy you should check out rapidfuzz as used in the answers to [Is there a way to modify this code to reduce run time?](https://stackoverflow.com/questions/68483600/is-there-a-way-to-modify-this-code-to-reduce-run-time) — DarrylG, Aug 29 '23 at 01:37
Did my previous comment help on using rapidfuzz? Do you need further info on how to apply rapidfuzz to your problem? — DarrylG, Aug 30 '23 at 13:18

My solution for categorizing company names is too slow

0 Answers0