0

I am trying to group the similar names of companies basis fuzzy matching ( within same column ). But neither they are grouping correctly nor do I have the same number of rows in the resulting dataset. As a result of one to many match, the number of rows are more than what is there in original data.

Input File sample with more records

enter image description here

Input File/Column

  • **Code **
df.loc[:,'Account Name Copy'] = df['Account Name']

compare = pd.MultiIndex.from_product([df['Account Name'],
                                      df['Account Name Copy']]).to_series()

def metrics(tup):
    return pd.Series([fuzz.ratio(*tup),
                      fuzz.token_sort_ratio(*tup)],
                     ['ratio', 'token'])

compare.apply(metrics)

Current Output

- Current Output

P.S. The number of rows should remain the same in final output as it is in original data with similar company names being grouped.

Desired Output

Desired Output

Referred to below topics, but didn't get the desired output

https://stackoverflow.com/questions/54865890/fuzzy-match-strings-in-one-column-and-create-new-dataframe-using-fuzzywuzzy

https://stackoverflow.com/questions/71427827/fuzzy-matching-and-grouping

https://stackoverflow.com/questions/60987641/check-if-there-is-a-similar-string-in-the-same-column

https://stackoverflow.com/questions/62085777/fuzzy-match-within-the-same-column-python

Please help !!

ss_0708
  • 183
  • 1
  • 10

1 Answers1

0

Here is a naïve approach with partial_ratio that (may?) lead to your expected output.

from itertools import product
from fuzzywuzzy import fuzz

df = pd.read_excel("file.xlsx")

RATIO = 80 # <-- adjust the ratio here

tups = list(product(df["Account Name"].unique(),
                    df["Account Name"].str.split(r"[-\s]").str[0].unique()))

matches = [(pair[1].title(), pair[0]) for pair in tuples_list
           if fuzz.partial_ratio(pair[1].lower(), pair[0].lower()) >= RATIO]
    
out = pd.DataFrame(index=pd.MultiIndex.from_tuples(set(matches),
                   names=["Grouped", "Account Name"])).sort_index()

Output :

enter image description here

Timeless
  • 22,580
  • 4
  • 12
  • 30
  • Thanks for your help, but it doesn't seem to produce the desired output as there are many rows where it's grouping even on the basis of first letter, example AB China life insurance, ABC China Fin-tech, ABC China finance etc. are also grouped as "A", I am open to any other platform or tool like alteryx, VBA, Visual studio, if that helps in such cases – ss_0708 Apr 20 '23 at 15:16
  • Oops! But as you may know, I made my code based on the given example and in this one I can't find the names you mention in your comment. Anyways, I wish you good luck ;) – Timeless Apr 20 '23 at 15:29
  • Sorry - but that's a huge data, I have edited the question to include few such records. Please review.. – ss_0708 Apr 20 '23 at 16:20