How to merge two pandas DataFrames based on a similarity function?

Question

Given dataset 1

name,x,y
st. peter,1,2
big university portland,3,4

and dataset 2

name,x,y
saint peter3,4
uni portland,5,6

The goal is to merge on

d1.merge(d2, on="name", how="left")

There are no exact matches on name though. So I'm looking to do a kind of fuzzy matching. The technique does not matter in this case, more how to incorporate it efficiently into pandas.

For example, st. peter might match saint peter in the other, but big university portland might be too much of a deviation that we wouldn't match it with uni portland.

One way to think of it is to allow joining with the lowest Levenshtein distance, but only if it is below 5 edits (st. --> saint is 4).

The resulting dataframe should only contain the row st. peter, and contain both "name" variations, and both x and y variables.

Is there a way to do this kind of merging using pandas?

I think you can create new columns `df1['new']` and `df2['new']` your custom function and then merge them by this column like `d1.merge(d2, on="new", how="left")` — jezrael, Feb 13 '16 at 14:13
The function is just deciding which is the best match, and deciding whether or not there is a match at all. Consider the later added Levenhstein distance example. — PascalVKooten, Feb 13 '16 at 14:15
Great question! Did you come up with a solution @PascalVKooten? — Andrew Anderson, Apr 08 '23 at 09:40

majr · Answer 1 · 2016-02-15T11:42:36.660

7

Did you look at fuzzywuzzy?

You might do something like:

import pandas as pd
import fuzzywuzzy.process as fwp

choices = list(df2.name)

def fmatch(row): 
    minscore=95 #or whatever score works for you
    choice,score = fwp.extractOne(row.name,choices)
    return choice if score > minscore else None

df1['df2_name'] = df1.apply(fmatch,axis=1)
merged = pd.merge(df1, 
                  df2,
                  left_on='df2_name',
                  right_on='name',
                  suffixes=['_df1','_df2'],
                  how = 'outer') # assuming you want to keep unmatched records

Caveat Emptor: I haven't tried to run this.

edited Feb 15 '16 at 11:42

answered Feb 15 '16 at 10:15

majr

263
1
6

I wonder how we can record the matched score along the matched choice? Inspecting the distribution of scores, can better guide a meaningful choice for the threshold. – Jariani Dec 26 '20 at 13:26

score 1 · Answer 2 · answered Feb 13 '16 at 15:34

Let's say you have that function which returns the best match if any, None otherwise:

def best_match(s, candidates):
    ''' Return the item in candidates that best matches s.

    Will return None if a good enough match is not found.
    '''
    # Some code here.

Then you can join on the values returned by it, but you can do it in different ways that would lead to different output (so I think, I did not look much at this issue):

(df1.assign(name=df1['name'].apply(lambda x: best_match(x, df2['name'])))
 .merge(df2, on='name', how='left'))

(df1.merge(df2.assign(name=df2['name'].apply(lambda x: best_match(x, df1['name'])))),
           on='name', how='left'))

score 1 · Answer 3 · answered Feb 13 '16 at 15:51

The simplest idea I can get now is to create special dataframe with distances between all names:

>>> from Levenshtein import distance
>>> df1['dummy'] = 1
>>> df2['dummy'] = 1
>>> merger = pd.merge(df1, df2, on=['dummy'], suffixes=['1','2'])[['name1','name2', 'x2', 'y2']]
>>> merger
                     name1         name2  x2  y2
0                st. peter   saint peter   3   4
1                st. peter  uni portland   5   6
2  big university portland   saint peter   3   4
3  big university portland  uni portland   5   6

>>> merger['res'] = merger.apply(lambda x: distance(x['name1'], x['name2']), axis=1)
>>> merger
                     name1         name2  x2  y2  res
0                st. peter   saint peter   3   4    4
1                st. peter  uni portland   5   6    9
2  big university portland   saint peter   3   4   18
3  big university portland  uni portland   5   6   11
>>> merger = merger[merger['res'] <= 5]
>>> merger
       name1        name2  x2  y2  res
0  st. peter  saint peter   3   4    4

>>> del df1['dummy']
>>> del merger['res']
>>> pd.merge(df1, merger, how='left', left_on='name', right_on='name1')
                      name  x  y      name1        name2  x2  y2
0                st. peter  1  2  st. peter  saint peter   3   4
1  big university portland  3  4        NaN          NaN NaN NaN

This wouldn't scale with larger frames. I have a mere 6000 by 6000 matching, but that would explode using this technique. — PascalVKooten, Feb 13 '16 at 20:35
well you can of course speed it up a bit - first join those which are exact match and then use levenshtein thing only on remaining ones — Roman Pekar, Feb 13 '16 at 21:21
The point is that you create at first a 2 x 2 x num_vars sized frame. That would mean a 6000 x 6000 x num_vars sized frame.... — PascalVKooten, Feb 13 '16 at 23:06

How to merge two pandas DataFrames based on a similarity function?

3 Answers3

Linked