I have two data frames that I'm trying to merge, based on a primary & foreign key of company name. One data set has ~50,000 unique company names, the other one has about 5,000. Duplicate company names are possible within each list.
To that end, I've tried to follow along the first solution from Figure out if a business name is very similar to another one - Python. Here's an MWE:
mwe1 = pd.DataFrame({'company_name': ['Deloitte',
'PriceWaterhouseCoopers',
'KPMG',
'Ernst & Young',
'intentionall typo company XYZ'
],
'revenue': [100, 200, 300, 250, 400]
}
)
mwe2 = pd.DataFrame({'salesforce_name': ['Deloite',
'PriceWaterhouseCooper'
],
'CEO': ['John', 'Jane']
}
)
I am trying to get the following code from Figure out if a business name is very similar to another one - Python to work:
# token2frequency is just a word counter of all words in all names
# in the dataset
def sequence_uniqueness(seq, token2frequency):
return sum(1/token2frequency(t)**0.5 for t in seq)
def name_similarity(a, b, token2frequency):
a_tokens = set(a.split())
b_tokens = set(b.split())
a_uniq = sequence_uniqueness(a_tokens)
b_uniq = sequence_uniqueness(b_tokens)
return sequence_uniqueness(a.intersection(b))/(a_uniq * b_uniq) ** 0.5
How do I apply those two functions to produce a similarity score between each possible combination of mwe1
and mwe2
, then filter such that to the most probable matches?
For example, I'm looking for something like this (I'm just making up the scores in the similarity_score
column:
company_name revenue salesforce_name CEO similarity_score
Deloitte 100 Deloite John 98
PriceWaterhouseCoopers 200 Deloite John 0
KPMG 300 Deloite John 15
Ernst & Young 250 Deloite John 10
intentionall typo company XYZ 400 Deloite John 2
Deloitte 100 PriceWaterhouseCooper Jane 20
PriceWaterhouseCoopers 200 PriceWaterhouseCooper Jane 97
KPMG 300 PriceWaterhouseCooper Jane 5
Ernst & Young 250 PriceWaterhouseCooper Jane 7
intentionall typo company XYZ 400 PriceWaterhouseCooper Jane 3
I'm also open to better end-states, if you can think of one. Then, I'd filter that table above to get something like:
company_name revenue salesforce_name CEO similarity_score
Deloitte 100 Deloite John 98
PriceWaterhouseCoopers 200 PriceWaterhouseCooper Jane 97
Here's what I've tried:
name_similarity(a = mwe1['company_name'], b = mwe2['salesforce_name'], token2frequency = 10)
AttributeError: 'Series' object has no attribute 'split'
I'm familiar with using lambda functions but not sure how to make it work when iterating through two columns in two Pandas data frames.