2

I have 2 dataframes df1 and df2 like this:

df1:

Id    Name      
1    Tuy Hòa      
2    Kiến thụy 
3    Bình Tân

df2:

code    name      
A1     Tuy Hoà    
A2     Kiến Thụy  
A3     Tân Bình 

Now when I use merge:

out_df = pd.merge(df1, df2, left_on=['Name'], right_on=['name'], how='outer')

Of course the result is null, because example for Tuy Hòa <> Tuy Hoà (problem is the position sign "`"),... Then I use fuzzy function and apply lambda but it's not correct at all :(

My function:

def checker(Name,names):
    correct_name = ''
    correct_ratio = 0
    if Name in names:
        ratio = 100
    else:  
        try:
            x = process.extractOne(Name, names, scorer=fuzz.token_set_ratio, score_cutoff = 80)
            correct_name = x[0]
            #correct_ratio = x[1]
        except:
            correct_name = ''
            #correct_ratio = 0
    return correct_name

Then I convert:

names = df2['name'].tolist()

After that I use apply to loop through, compare and output the correct (df1 must base on df2 to correct)

df1['Name'] = df1['Name'].apply(lambda x: checker(x,names) if not checker(x,names) else x)

The output now is:

df1

Id    Name      
1    Tuy Hòa      
2    Kiến Thụy 
3    Tân Bình

Seem like Tuy Hòa and Kiến Thụy are correct but Bình Tân and Tân Bình are different name! so changing position charaters in this case is correct not error in typing :( The expect output should be:

df1

Id    Name      
1    Tuy Hòa      
2    Kiến Thụy 
3    Bình Tân

I try to use fuzz.ratio with score_cutoff = 55 but some case this will miss pairs

Any solution is better than my solution? (maybe not using fuzzy library?) Please help me. Many thanks to your help <3

Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
Tung Nguyen
  • 410
  • 3
  • 11
  • The scorer you chose for fuzzywuzzy (token set) compares the strings as (as the name suggests) sets of tokens. Sets don't have an internal ordering, so this scorer will return "Foo Bar" as an exact match of "Bar Foo". – Andy Jun 23 '21 at 14:45

2 Answers2

2

Update: the strategy is the same than my previous answer but the algorithm has been replaced from Levenshtein distance to Damerau–Levenshtein distance.

I slightly modified your input data to better understanding:

>>> df1
    Id        Name
11   1     Tuy Hòa  # match 18 'Tuy Hoà'
12   2   Kiến thụy  # match 16 'Kiến Thụy'
13   3    Bình Tân  # no match
14   4  Louis Paul  # no match

>>> df2
   code       name
16   A2  Kiến Thụy
17   A3   Tân Bình
18   A1    Tuy Hoà
19   A4   John Doe

Use the Damerau–Levenshtein distance with pyxDamerauLevenshtein package to compute the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other. (src: Wikipedia).

# pip install pyxDamerauLevenshtein
from pyxdameraulevenshtein import damerau_levenshtein_distance_seqs

seqs = df1['Name'].apply(lambda s: damerau_levenshtein_distance_seqs(s, df2['name'])
>>> seqs
11    [8, 6, 2, 7]  # 11 <-> 18 (2 operations)
12    [1, 7, 8, 7]  # 12 <-> 16 (1 operation)
13    [7, 6, 7, 6]  # 13 <-> 17 or 19 (6 operations)
14    [9, 9, 8, 8]  # 14 <-> 18 or 19 (8 operations)
Name: Name, dtype: object

Apply merge with a maximum operations allowed:

MAXOPS = 3

df1['other'] = seqs.apply(lambda x: df2.index[x.index(min(x))]
                              if min(x) <= MAXOPS else pd.NA)
out = pd.merge(df1, df2, left_on='other', right_index=True, how='outer')
>>> out
       Id        Name other code       name
11.0  1.0     Tuy Hòa    18   A1    Tuy Hoà
12.0  2.0   Kiến thụy    16   A2  Kiến Thụy
13.0  3.0    Bình Tân  <NA>  NaN        NaN
14.0  4.0  Louis Paul  <NA>  NaN        NaN
NaN   NaN         NaN    17   A3   Tân Bình
NaN   NaN         NaN    19   A4   John Doe

Now you are free to drop all columns that you don't need, keep one name, etc.

Corralien
  • 109,409
  • 8
  • 28
  • 52
  • Thanks for your help, your code is really clean rather than using function like me ^^ but the problem is there are many except case like the demo case "Tân Bình" <-> "Bình Tân" (They are different so as I described above) – Tung Nguyen Jun 24 '21 at 04:04
  • @NLeeT, I update my answer. It should be right on your sample. – Corralien Jun 24 '21 at 08:02
  • Ah I forgot, do you know how to install it in Anaconda? I try to use "conda install -c conda-forge pyxdameraulevenshtein" but it doesn't work :( – Tung Nguyen Jun 24 '21 at 17:08
  • pip install is compatible with conda. – Corralien Jun 24 '21 at 17:41
1

If your deviations between both DataFrames are exclusively a combination of

  • different accents
  • upper-/lowercase differences and
  • swapped first and last names,

I would go for creating a new column name_normalized in both DataFrames which

If you do not get exact matches on merge still then, you probably are best advised to look at existing fuzzy matching libraries, as outlined in question is it possible to do fuzzy match merge with python pandas?

ojdo
  • 8,280
  • 5
  • 37
  • 60