6

I have two example dataframes as follows:

df1 = pd.DataFrame({'Name': {0: 'John', 1: 'Bob', 2: 'Shiela'}, 
                   'Degree': {0: 'Masters', 1: 'Graduate', 2: 'Graduate'}, 
                   'Age': {0: 27, 1: 23, 2: 21}}) 

df2 = pd.DataFrame({'Name': {0: 'John S.', 1: 'Bob K.', 2: 'Frank'}, 
                   'Degree': {0: 'Master', 1: 'Graduated', 2: 'Graduated'}, 
                   'GPA': {0: 3, 1: 3.5, 2: 4}}) 

I want to merge them together based on two columns Name and Degree with fuzzy matching method to drive out possible duplicates. This is what I have realized with the help from reference here: Apply fuzzy matching across a dataframe column and save results in a new column

from fuzzywuzzy import fuzz
from fuzzywuzzy import process

compare = pd.MultiIndex.from_product([df1['Name'],
                                      df2['Name']]).to_series()

def metrics(tup):
    return pd.Series([fuzz.ratio(*tup),
                      fuzz.token_sort_ratio(*tup)],
                     ['ratio', 'token'])
compare.apply(metrics)

compare.apply(metrics).unstack().idxmax().unstack(0)

compare.apply(metrics).unstack(0).idxmax().unstack(0)

Let's say fuzz.ratio of one's Name and Degree both are higher than 80 we consider they are same person. And taken Name and Degree from df1 as default. How can I get a following expected result? Thanks.

df = df1.merge(df2, on = ['Name', 'Degree'], how = 'outer')

      Name     Degree   Age  GPA    duplicatedName   duplicatedDegree 
0     John    Masters  27.0  3.0         John S.          Master
1      Bob   Graduate  23.0  3.5          Bob K.         Graduated
2   Shiela   Graduate  21.0  NaN          NaN            Graduated
3    Frank  Graduated   NaN  4.0          NaN            Graduate
ah bon
  • 9,293
  • 12
  • 65
  • 148

1 Answers1

2

I think ratio should be lower, for me working 60. Create Series with list comprehension, filter by N and get maximal value. Last map with fillna and last merge:

from fuzzywuzzy import fuzz
from fuzzywuzzy import process
from  itertools import product

N = 60
names = {tup: fuzz.ratio(*tup) for tup in 
           product(df1['Name'].tolist(), df2['Name'].tolist())}

s1 = pd.Series(names)
s1 = s1[s1 > N]
s1 = s1[s1.groupby(level=0).idxmax()]

print (s1)
John S.    John
Bob K.      Bob
dtype: object

degrees = {tup: fuzz.ratio(*tup) for tup in 
           product(df1['Degree'].tolist(), df2['Degree'].tolist())}

s2 = pd.Series(degrees)
s2 = s2[s2 > N]
s2 = s2[s2.groupby(level=0).idxmax()]
print (s2)
Graduated    Graduate
Master        Masters
dtype: object

df2['Name'] = df2['Name'].map(s1).fillna(df2['Name'])
df2['Degree'] = df2['Degree'].map(s2).fillna(df2['Degree'])
#generally slowier alternative
#df2['Name'] = df2['Name'].replace(s1)
#df2['Degree'] = df2['Degree'].replace(s2)

df = df1.merge(df2, on = ['Name', 'Degree'], how = 'outer')
print (df)
     Name    Degree   Age  GPA
0    John   Masters  27.0  3.0
1     Bob  Graduate  23.0  3.5
2  Shiela  Graduate  21.0  NaN
3   Frank  Graduate   NaN  4.0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Awesome. Thanks. But I get an ValueError while print s1 in real data. result = values.argmax(axis) ValueError: attempt to get argmax of an empty sequence – ah bon Jan 05 '19 at 09:24
  • @ahbon -hmmm, get me a some time. – jezrael Jan 05 '19 at 09:26
  • Error happens from here: s1 = s1[s1 > N].unstack().idxmax(). My actual data for columns Name and Degree is not English, maybe that's the reason it doesn't work out? – ah bon Jan 05 '19 at 09:31
  • 1
    @ahbon - I think `unstack` create some `NaN`s here, so rewrite code by `s1 = s1[s1 > N] s1 = s1[s1.groupby(level=0).idxmax()]`, check edited answer. – jezrael Jan 05 '19 at 09:32
  • 1
    Still on computing. It looks take some time. – ah bon Jan 05 '19 at 10:07
  • It's seems didn't work very well with my actual data. May I ask, is it possible to append common or duplicated rows of df2 by adding two new columns duplicateName and duplicatedDegree in df1 for me to check? – ah bon Jan 05 '19 at 16:01
  • @ahbon - Not sure if understand, can you explain with data? – jezrael Jan 05 '19 at 16:03
  • 1
    @ahbon - Yes, so use `df2['duplicatedName'] = df2['Name'].map(s1).fillna(df2['Name']) df2['duplicatedDegree '] = df2['Degree'].map(s2).fillna(df2['Degree'])` and then `df = df1.merge(df2, left_on = ['Name', 'Degree'],right_on = ['duplicatedName', 'duplicatedDegree '], how = 'outer')` – jezrael Jan 05 '19 at 16:06
  • I rerun the whole test code, but can't get final result df. I think maybe there are problem here: df2['Name'] = df2['Name'].map(s1).fillna(df2['Name']) df2['Degree'] = df2['Degree'].map(s2).fillna(df2['Degree']). Could you please try again? @jezrael – ah bon Jan 06 '19 at 01:10
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/186261/discussion-between-ahbon-and-jezrael). – ah bon Jan 06 '19 at 14:49
  • Sorry, may I ask a new question? What if in df1 and df2 has some duplicated names, but the pairs of Name and Degree are uniques? I got an error as follows: raise InvalidIndexError('Reindexing only valid with uniquely' InvalidIndexError: Reindexing only valid with uniquely valued Index objects – ah bon Jan 07 '19 at 11:03
  • @ahbon - Is possible create new question? Also with data what return this error? thank you. – jezrael Jan 07 '19 at 11:13
  • I send you test my data for by email. Please check. Thanks. @jezrael – ah bon Jan 07 '19 at 12:17