0

I have two dataframes (~10k row each) and I want to find the best match for each entry in the other.

Specifically, here is an example: For each entry in df1 col user, I want to find the best fuzzy match in the corresponding col in df2. Then I want to include the location of the matching entry and the matched entry in the final dataframe.

import numpy as np
np.random.seed(123)
df1 = pd.DataFrame({'user': ["aparna", "pankaj", "sudhir", "Geeku"], 
                   'location': np.random.choice( [5,7,3], 4)})
df2 = pd.DataFrame({'user': ["aparn", "arup", "Pankaj", "sudhir c", "Geek", "abc"], 
                   'location': np.random.choice( [5,7,3], 6)})

Each dataframe look like this:

      user  location
0     aparn         5
1      arup         3
2    Pankaj         3
3  sudhir c         7
4      Geek         3
5       abc         7

And the final result look like this

      matching_user  location1 matched_user location
0     aparn         5     aparna           7
1    pankaj         3     Pankaj           5
2    sudhir         7     sudhir c         7
...
Kabocha Porter
  • 301
  • 3
  • 8

1 Answers1

1

You can use thefuzz.process.extractOne:

# pip install thefuzz
from thefuzz import process

df1.merge(df2, left_on=df1['user'].apply(lambda x: process.extractOne(x, df2['user'])[0]),
          right_on='user',
          suffixes=('_1', '_2')
         ).drop(columns='user')

output:

   user_1  location_1    user_2  location_2
0  aparna           3     aparn           5
1  pankaj           7    Pankaj           3
2  sudhir           3  sudhir c           7
3   Geeku           3      Geek           3
mozway
  • 194,879
  • 13
  • 39
  • 75
  • Thanks, one more question: this finds a match for each row on `df1`. But I want to set a threshold such that if the best match is way off then do not include them in the final result. How do you do that? That is to say, not every case in `df1` will be included in the final result. – Kabocha Porter Dec 07 '21 at 23:40
  • Check the raw output of `process.extractOne`, there is a threshold, you can use it to filter out what you want. – mozway Dec 08 '21 at 05:04