I have 2 excels files. In the first one there is a catalog of items and the category they belong, in the second one, it is a massive recollection of data items.
The problems starts in my excel with data, the strings items compared to the catalog excel file may not be 100% identical one to another, my intention is to try a "string metric" to be able to find the best match, because even when it could not be the 100% identical there is a match between the 2 excels. (With different name)
This may not be the best approach but it was my best try at the moment, I kindly accept suggestions.
My approximation was using Levenshtein distance included in the fuzzywuzzy lib, and kinda work it but take so long so I changed the code to use multiprocessing but now it's taking the same time and doing some weird bugs.
this is my code:
from fuzzywuzzy import fuzz
import pandas as pd
import os
import multiprocessing
data = pd.read_excel("...\\data_items.xlsx")
catalog = pd.read_excel("...\\catalog.xlsx")
def find_best_match(string, catalog_tokens):
# Compute the set of tokens for the input string
string_tokens = set(string.split())
# Compute the similarity score between the input string and each string in the catalog DataFrame
scores = catalog_tokens.apply(lambda x: fuzz.token_set_ratio(string_tokens, x))
# Find the index of the string in the catalog DataFrame with the highest similarity score
index = scores.idxmax()
# Return the best match and category for the input string
return catalog.loc[index, "Item name"], catalog.loc[index, "Category name"]
if __name__ == "__main__":
multiprocessing.freeze_support()
# Precompute the set of tokens for each string in the catalog DataFrame
catalog_tokens = catalog["Item name"].apply(lambda x: set(x.split()))
# Use multiprocessing to parallelize the computation across multiple cores
pool = multiprocessing.Pool()
results = pool.starmap(find_best_match, [(string, catalog_tokens) for string in data["item"]])
pool.close()
pool.join()
# Unpack the results into separate lists
best_matches, categories = zip(*results)
print(f"Best_matches: {best_matches} | Categories: {categories}")
# Add the best_matches and categories columns to the data DataFrame
data["Item_name"] = best_matches
data["Category"] = categories
data.to_excel("...\\New_data_items.xlsx", index=False)