I have 2 pandas data frames that I would like to do a fuzzy match based on one of the columns. One data frame (the reference data frame) with ~ 5000 rows contains aliases of names with similar ID's that I would like to match up with the other data frame (external) of ~1500 rows with ID's as well.
So for each match, I will get the fuzzy scores and then decide which score I would like to use as the best match between both data frames. A subset of both data frames is used in the example below.
import pandas as pd
from fuzzywuzzy import fuzz
#Reference data frame
aka_df = pd.DataFrame({"grp_id":['M-00353','M-00353','M-00353','M-00538','M-00538','M-00160','M-00160','M-00160','M-00509','M-00509','M-00509','M-00509'],
"name": ['Buatan I Mill','Inti Indosawit Subur (PKS Buatan I)','Pt Inti Indosawit Subur Buatan I','Batang Kulim','Musim Mas - Batang Kulim','Hindoli (Sungai Lilin)','Hindoli (Pks Sei Lilin)','PT Hindoli',
'Agrowiratama Sukajadi Sawit Mekar 1','PKS PT. Sukajadi Sawit Mekar','Sukajadi Sawit Mekar','Sukajadi Sawit Mekar 1']})
# External data frame
ext_df = pd.DataFrame({"uml_id": ['P021','P054','P058','P106'],
"mill_name": ['Pt Inti Indosawit Subur - Buatan I','Pt. Musim Mas - Pks Batang Kulim', 'Cargill - Hindoli', 'Pks Pt. Sukajadi Sawit Mekar']})
compare = pd.MultiIndex.from_product([aka_df['name'],ext_df['mill_name']]).to_series()
def metrics(tup):
return pd.Series([fuzz.ratio(*tup)],['ratio'])
# Create df
compare_df = compare.apply(metrics)
# Reshaping data frame
merge_df = compare_df.reset_index()
unstack_df = merge_df.groupby(['level_0','level_1'])[ 'ratio'].mean().unstack('level_1')
My reference for the code to compare the 2 data frames is from this question. It does roughly give me what I need which is this:
However, the more critical column I need from the matching matrix is the ID's from both data frames as shown below.
I am still pretty new to python and I am sure this is pretty easy to do but some direction on this is very much much appreciated.