0

I have two data frames that I'm trying to merge, based on a primary & foreign key of company name. One data set has ~50,000 unique company names, the other one has about 5,000. Duplicate company names are possible within each list.

To that end, I've tried to follow along the first solution from Figure out if a business name is very similar to another one - Python. Here's an MWE:

mwe1 = pd.DataFrame({'company_name': ['Deloitte', 
                                      'PriceWaterhouseCoopers', 
                                      'KPMG',
                                      'Ernst & Young',
                                      'intentionall typo company XYZ'
                                     ],
                    'revenue': [100, 200, 300, 250, 400]
                   }
                  )

mwe2 = pd.DataFrame({'salesforce_name': ['Deloite',
                                         'PriceWaterhouseCooper'
                                        ],
                     'CEO': ['John', 'Jane']
                    }
                   )

I am trying to get the following code from Figure out if a business name is very similar to another one - Python to work:

# token2frequency is just a word counter of all words in all names
# in the dataset
def sequence_uniqueness(seq, token2frequency):
    return sum(1/token2frequency(t)**0.5 for t in seq)

def name_similarity(a, b, token2frequency):
    a_tokens = set(a.split())
    b_tokens = set(b.split())
    a_uniq = sequence_uniqueness(a_tokens)
    b_uniq = sequence_uniqueness(b_tokens)
    return sequence_uniqueness(a.intersection(b))/(a_uniq * b_uniq) ** 0.5

How do I apply those two functions to produce a similarity score between each possible combination of mwe1 and mwe2, then filter such that to the most probable matches?

For example, I'm looking for something like this (I'm just making up the scores in the similarity_score column:

company_name                   revenue    salesforce_name         CEO     similarity_score
Deloitte                       100        Deloite                 John    98
PriceWaterhouseCoopers         200        Deloite                 John    0
KPMG                           300        Deloite                 John    15
Ernst & Young                  250        Deloite                 John    10
intentionall typo company XYZ  400        Deloite                 John    2
Deloitte                       100        PriceWaterhouseCooper   Jane    20
PriceWaterhouseCoopers         200        PriceWaterhouseCooper   Jane    97
KPMG                           300        PriceWaterhouseCooper   Jane    5
Ernst & Young                  250        PriceWaterhouseCooper   Jane    7
intentionall typo company XYZ  400        PriceWaterhouseCooper   Jane    3

I'm also open to better end-states, if you can think of one. Then, I'd filter that table above to get something like:

company_name                   revenue    salesforce_name         CEO     similarity_score
Deloitte                       100        Deloite                 John    98
PriceWaterhouseCoopers         200        PriceWaterhouseCooper   Jane    97

Here's what I've tried:

name_similarity(a = mwe1['company_name'], b = mwe2['salesforce_name'], token2frequency = 10)
AttributeError: 'Series' object has no attribute 'split'

I'm familiar with using lambda functions but not sure how to make it work when iterating through two columns in two Pandas data frames.

user2205916
  • 3,196
  • 11
  • 54
  • 82

1 Answers1

0

Here is a class I wrote using difflib should be close to what you need.

import difflib

import pandas as pd


class FuzzyMerge:
    """
    Works like pandas merge except merges on approximate matches.
    """
    def __init__(self, **kwargs):
        self.left = kwargs.get("left")
        self.right = kwargs.get("right")
        self.left_on = kwargs.get("left_on")
        self.right_on = kwargs.get("right_on")
        self.how = kwargs.get("how", "inner")
        self.cutoff = kwargs.get("cutoff", 0.8)

    def merge(self) -> pd.DataFrame:
        temp = self.right.copy()
        temp[self.left_on] = [
            self.get_closest_match(x, self.left[self.left_on]) for x in temp[self.right_on]
        ]

        df = self.left.merge(temp, on=self.left_on, how=self.how)
        df["similarity_percent"] = df.apply(lambda x: self.similarity_score(x[self.left_on], x[self.right_on]), axis=1)

        return df

    def get_closest_match(self, left: pd.Series, right: pd.Series) -> str or None:
        matches = difflib.get_close_matches(left, right, cutoff=self.cutoff)

        return matches[0] if matches else None

    @staticmethod
    def similarity_score(left: pd.Series, right: pd.Series) -> int:
        return int(round(difflib.SequenceMatcher(a=left, b=right).ratio(), 2) * 100)

Call it with:

df = FuzzyMerge(left=df1, right=df2, left_on="column from df1", right_on="column from df2", how="inner", cutoff=0.8).merge()
Jason Baker
  • 3,170
  • 2
  • 12
  • 15
  • The code worked pretty nicely in that it gave no errors, produces a Pandas dataframe and shares a similarity score. But right now, it gives 100 similarity score for: `ACADIA Pharmaceuticals Inc.` and `Shanghai Pharmaceuticals`; `Zuger Kantonalbank` and `Basler Kantonalbank`; `Zuger Kantonalbank` and `Luzerner Kantonalbank` – user2205916 Nov 30 '22 at 22:23
  • Another weird one: `Ynnovate` and `Novatek` have a similarity score of 100 – user2205916 Nov 30 '22 at 22:30
  • 1
    odd Ynnovate and Novatek have a similarity score of 67 in a small test I just ran. When I find time I'll try to use your data. – Jason Baker Nov 30 '22 at 22:36
  • Hm, I get your code to work with MWE dataframes in my original post. Interesting to note that the similarity score for all pairs/combinations is 100 on my actual dataset. Immediately, I checked the data type but it is indeed an "object" dtype in my actual dataset, like the MWE, so it's not that. – user2205916 Nov 30 '22 at 23:26