How to get string similarity between elements from 2 dataframes

Question

I have 2 excels files. In the first one there is a catalog of items and the category they belong, in the second one, it is a massive recollection of data items.

The problems starts in my excel with data, the strings items compared to the catalog excel file may not be 100% identical one to another, my intention is to try a "string metric" to be able to find the best match, because even when it could not be the 100% identical there is a match between the 2 excels. (With different name)

This may not be the best approach but it was my best try at the moment, I kindly accept suggestions.

My approximation was using Levenshtein distance included in the fuzzywuzzy lib, and kinda work it but take so long so I changed the code to use multiprocessing but now it's taking the same time and doing some weird bugs.

this is my code:

from fuzzywuzzy import fuzz
import pandas as pd
import os
import multiprocessing

data = pd.read_excel("...\\data_items.xlsx")
catalog = pd.read_excel("...\\catalog.xlsx")


def find_best_match(string, catalog_tokens):
    # Compute the set of tokens for the input string
    string_tokens = set(string.split())
    # Compute the similarity score between the input string and each string in the catalog DataFrame
    scores = catalog_tokens.apply(lambda x: fuzz.token_set_ratio(string_tokens, x))
    # Find the index of the string in the catalog DataFrame with the highest similarity score
    index = scores.idxmax()
    # Return the best match and category for the input string
    return catalog.loc[index, "Item name"], catalog.loc[index, "Category name"]

if __name__ == "__main__":
    multiprocessing.freeze_support()

    # Precompute the set of tokens for each string in the catalog DataFrame
    catalog_tokens = catalog["Item name"].apply(lambda x: set(x.split()))

    # Use multiprocessing to parallelize the computation across multiple cores
    pool = multiprocessing.Pool()
    results = pool.starmap(find_best_match, [(string, catalog_tokens) for string in data["item"]])
    pool.close()
    pool.join()

    # Unpack the results into separate lists
    best_matches, categories = zip(*results)
    print(f"Best_matches: {best_matches}   |  Categories: {categories}")

    # Add the best_matches and categories columns to the data DataFrame
    data["Item_name"] = best_matches
    data["Category"] = categories

data.to_excel("...\\New_data_items.xlsx", index=False)

One optimization you can make is to precompute the set of tokens for each string in the catalog DataFrame outside the find_best_match() function. You can then pass this precomputed data to the function, rather than computing it repeatedly for each string in the data DataFrame. — Abdulmajeed, Mar 03 '23 at 18:10
https://stackoverflow.com/questions/73972481/is-there-a-way-to-speed-up-matching-addresses-and-level-of-confidence-per-match/73973651#73973651 seems to be a relatively similar problem — maxbachmann, Mar 04 '23 at 18:48

How to get string similarity between elements from 2 dataframes

0 Answers0