1

I have 2 pandas data frames that I would like to do a fuzzy match based on one of the columns. One data frame (the reference data frame) with ~ 5000 rows contains aliases of names with similar ID's that I would like to match up with the other data frame (external) of ~1500 rows with ID's as well.

So for each match, I will get the fuzzy scores and then decide which score I would like to use as the best match between both data frames. A subset of both data frames is used in the example below.

import pandas as pd    
from fuzzywuzzy import fuzz

#Reference data frame
aka_df = pd.DataFrame({"grp_id":['M-00353','M-00353','M-00353','M-00538','M-00538','M-00160','M-00160','M-00160','M-00509','M-00509','M-00509','M-00509'],
    "name": ['Buatan I Mill','Inti Indosawit Subur (PKS Buatan I)','Pt Inti Indosawit Subur Buatan I','Batang Kulim','Musim Mas - Batang Kulim','Hindoli (Sungai Lilin)','Hindoli (Pks Sei Lilin)','PT Hindoli',
'Agrowiratama Sukajadi Sawit Mekar 1','PKS PT. Sukajadi Sawit Mekar','Sukajadi Sawit Mekar','Sukajadi Sawit Mekar 1']})

# External data frame    
ext_df = pd.DataFrame({"uml_id": ['P021','P054','P058','P106'],
"mill_name": ['Pt Inti Indosawit Subur - Buatan I','Pt. Musim Mas - Pks Batang Kulim', 'Cargill - Hindoli', 'Pks Pt. Sukajadi Sawit Mekar']})


compare = pd.MultiIndex.from_product([aka_df['name'],ext_df['mill_name']]).to_series()

def metrics(tup):
    return pd.Series([fuzz.ratio(*tup)],['ratio'])

# Create df
compare_df = compare.apply(metrics)

# Reshaping data frame
merge_df = compare_df.reset_index()
unstack_df = merge_df.groupby(['level_0','level_1'])[ 'ratio'].mean().unstack('level_1')

My reference for the code to compare the 2 data frames is from this question. It does roughly give me what I need which is this:

enter image description here

However, the more critical column I need from the matching matrix is the ID's from both data frames as shown below.

enter image description here

I am still pretty new to python and I am sure this is pretty easy to do but some direction on this is very much much appreciated.

Funkeh-Monkeh
  • 649
  • 6
  • 17
  • Similar question [here](https://stackoverflow.com/questions/38969383/fuzzy-string-matching-in-python) that might help you. – BernardL Oct 25 '18 at 17:31

1 Answers1

1

You can use merge() along with a dictionary comprehension in your function passed to apply():

import pandas as pd    
from fuzzywuzzy import process,fuzz

def get_matches(x):

    matches = process.extract(x['name'], ext_df['mill_name'], scorer=fuzz.ratio)
    return pd.Series({ext_df.iloc[i[2]]['uml_id']: i[1] for i in matches})

aka_df.merge(aka_df.sort_values('name').apply(get_matches, axis=1), left_index=True, right_index=True)

Yields:

     grp_id                                 name  P021  P054  P058  P106
0   M-00353                        Buatan I Mill    34    36    27    29
1   M-00353  Inti Indosawit Subur (PKS Buatan I)    91    48    27    32
2   M-00353     Pt Inti Indosawit Subur Buatan I    97    50    29    40
3   M-00538                         Batang Kulim    30    55    28    15
4   M-00538             Musim Mas - Batang Kulim    38    86    34    23
5   M-00160               Hindoli (Sungai Lilin)    36    15    37    33
6   M-00160              Hindoli (Pks Sei Lilin)    32    37    36    32
7   M-00160                           PT Hindoli    36    33    59    26
8   M-00509  Agrowiratama Sukajadi Sawit Mekar 1    35    15    12    70
9   M-00509         PKS PT. Sukajadi Sawit Mekar    39    30    18   100
10  M-00509                 Sukajadi Sawit Mekar    33    19    11    83
11  M-00509               Sukajadi Sawit Mekar 1    36    22    15    80

Also, here is a helpful link that offers some insight into the different fuzzywuzzy scoring algorithms.

rahlf23
  • 8,869
  • 4
  • 24
  • 54
  • just a quick question @rahlf23 - do you know why quite a large number of values come out as `nan` once I apply it to be a bigger dataset? I am hoping to get an entire range of scores (0-100) – Funkeh-Monkeh Oct 26 '18 at 02:34
  • It's tough to say without seeing a sample dataframe that reproduces the "issue". You could troubleshoot the problem by adding print statements in the applied function `get_matches()` to see why you are getting `nan` values. – rahlf23 Oct 26 '18 at 04:34
  • Thanks @rahlf23, it was a matter of changing the`limit` in the `process extract` function – Funkeh-Monkeh Oct 26 '18 at 16:29