The purpose of the code is to cluster similar company name's together so I can give the similar names codes. I still need to do some adjustments for the desired output but the problem I have right now is that it's too slow. For 20k rows it takes 2 minutes to complete. For 100k rows it takes 20 minutes to complete (my laptop is very slow, I guess in a faster one the timing would be better). The time it needs to complete it accelerates very quickly with the row count. The program maybe is going to be used for 1+ million rows. It would take days to complete that.
I tried multiprocessing but I don't know if it would work for this problem. I think thelogic behind the name_category and get_most_similar_word is not good for multiprocessing. Or I just wasn't able to implement it correctly.
Some of the other algorithms I had in mind:
1- TF-IDF and Cosine Similarity 2- Trie Data Structure 3- N-grams
I tried implementing them but the output was poor so I stopped because I'm not sure if they would be able to give faster results without a poor output.
def get_most_similar_word(word, wordlist):
top_similarity = 0.0
most_similar_word = word
word = word
for candidate in wordlist:
if candidate == word:
top_similarity = 100
most_similar_word = candidate
break
similarity = fuzz.token_sort_ratio(word, candidate)
if similarity > top_similarity:
top_similarity = similarity
most_similar_word = candidate
return most_similar_word, top_similarity
# String cleaning - df['name'] to df['clean_name'] (I used strip().lower() and extermination of some common words like INC and LTD (only exterminates these words if they have ./-_, before and after them. And afterwards it exterminates those characters(./-_,) too.) But before doing all that we sort all the names by frequencies so when we are creating the name_category, our accuracy can be better.)
name_category = {}
for index, name in df['clean_name'].items():
most_similar_categorized_name, top_similarity = get_most_similar_word(name, name_category.keys())
if top_similarity > 90:
name_category.setdefault(most_similar_categorized_name, []).append(index)
else:
name_category[name] = [index]
I need a faster and preferably better name categorization algorithm for my project.
I need it to be at least 2 times faster than this. But even a small improvement would be a great help.