0

I would like to merge 2 dataframes, Problem is that the keys I am using do not contain the exact same values. So for example this is what df1 looks like

name                                    val3       
Wilder Deontay                           1
Fury Tyson                               2
Ortiz Luis                               3
Joshua Olaseni Oluwafemi Anthony         4

and df2

name1                        val       
Deontay Wilder               19
Tyson Fury                   20  
Luis Ortiz                   21
Anthony Joshua               10

The expected output is a merge of the two dataframes so

name1                      val          val3
Deontay Wilder             19             1
Tyson Fury                 20             2
Luis Ortiz                 21             3
Anthony Joshua             10             4
Emm
  • 2,367
  • 3
  • 24
  • 50

1 Answers1

1

Here is my solution,

>>> import pandas as pd
>>> from fuzzywuzzy import fuzz
>>> data = {
    'name': ['Wilder Deontay', 'Fury Tyson', 'Ortiz Luis', 'Joshua Olaseni Oluwafemi Anthony'],
    'val3': [1, 2, 3, 4]
}... ... ...
>>> df1 = pd.DataFrame(data)
>>> data2 = {
    'name1': ['Deontay Wilder', 'Tyson Fury', 'Luis Ortiz ', 'Anthony Joshua'],
    'val': [19, 20, 21, 10]
}... ... ...
>>> df2 = pd.DataFrame(data2)
>>> df1['key'] = 1
>>> df2['key'] = 1
>>> merged = df1.merge(df2, on='key')
>>> merged['similarity'] = merged.apply(lambda row: fuzz.token_set_ratio(row['name'], row['name1']), axis=1)
>>> merged[merged.similarity == 100][['name1', 'val', 'val3']]
             name1  val  val3
0   Deontay Wilder   19     1
5       Tyson Fury   20     2
10     Luis Ortiz    21     3
15  Anthony Joshua   10     4

First I make cross merge and then I look at the similarity. For detailed information about fuzzywuzzy and token_set_ratio: https://stackoverflow.com/a/31823872/8205554

Or you can use fuzzymatcher,

>>> from fuzzymatcher import fuzzy_left_join
>>> fuzzy_left_join(df1, df2, 'name', 'name1')[['name1', 'val', 'val3']]
            name1  val  val3
0  Deontay Wilder   19     1
1      Tyson Fury   20     2
2     Luis Ortiz    21     3
3  Anthony Joshua   10     4
E. Zeytinci
  • 2,642
  • 1
  • 20
  • 37