I have two very large tables df1 and df2 (multiple millions of rows each) of person-related data and each table has a column that contains the name of a person (column name: "Name"). The names of one and the same person can be written differently (e.g. "Jeff McGregor" or "Mr. J McGregor", etc.) among the two tables, which is why I want to apply fuzzy string matching with the fuzzywuzzy package in Python (this simply compares two strings and returns a similarity measure).
As an output (see df3 for the desired output table), I would like to fill the "Match_Flag" and the "Match_List" columns in the df1 according to the entries in df2. For every (unique) person in df1, I want to check if there are (fuzzy string) matches in the df2. If there is a string, the column "Match_Flag" should contain a "yes" and if not, a "no". The "Match_list" column should contain for every name a list of matches. If there is one match, the list would contain one entry and if there are e.g. three matches, the list would contain 3 matches. If there is no match, the list should be just empty.
This is the data:
df1
data_df1 = {'ID':[56382, 34732, 12423, 29574, 76532],
'Name':['Tom Hilley', 'Andreas Puthz', 'Jeff McGregor', 'Jack Ebbstein', 'Lisa Norwat'],
'Match_Flag':["", "", "", "", ""],
'Match_List':["", "", "", "", ""]}
df1 = pd.DataFrame(data_df1)
print(df1)
ID Name Match_Flag Match_List
0 56382 Tom Hilley
1 34732 Andreas Puthz
2 12423 Jeff McGregor
3 29574 Jack Ebbstein
4 76532 Lisa Norwat
df2
data_df2 = {'Name':['Tom Hilley', 'Madalina Peter', 'Russel Cross', 'Jenni Pey', 'Kanush Hawks', 'Mr. J McGregor', 'Ebbstein Jack', 'Mr. Jack Ebbstein'],
'Age':[16, 56, 33, 44, 24, 26, 86, 32]}
df2 = pd.DataFrame(data_df2)
print(df2)
Name Age
0 Tom Hilley 16
1 Madalina Peter 56
2 Russel Cross 33
3 Jenni Pey 44
4 Kanush Hawks 24
5 Mr. J McGregor 26
6 Ebbstein Jack 86
7 Mr. Jack Ebbstein 32
df3
data_df3 = {'ID':[56382, 34732, 12423, 29574, 76532],
'Name':['Tom Hilley', 'Andreas Puthz', 'Jeff McGregor', 'Jack Ebbstein', 'Lisa Norwat'],
'Match_Flag':["yes", "no", "yes", "yes", "no"],
'Match_List':[["Tom Hilley"], [], ["Mr. J McGregor"], ["Ebbstein Jack","Mr. Jack Ebbstein"], []]}
df3 = pd.DataFrame(data_df3)
print(df3)
ID Name Match_Flag Match_List
0 56382 Tom Hilley yes [Tom Hilley]
1 34732 Andreas Puthz no []
2 12423 Jeff McGregor yes [Mr. J McGregor]
3 29574 Jack Ebbstein yes [Ebbstein Jack, Mr. Jack Ebbstein]
4 76532 Lisa Norwat no []
My approach:
# import libraries
import pandas as pd
from fuzzywuzzy import fuzz
# create matching
for i in df1["Name"].unique().tolist():
# initialize matching list
matching_list = []
for j in df2["Name"].unique().tolist():
# create matching score
if fuzz.token_set_ratio(i, j) >= 90:
matching_list.append(j)
# create red flags
if matching_list:
df1.loc[df1['Name'] == i,'Match_Flag'] = 'yes'
df1.loc[df1['Name'] == i,'Match_List'] = matching_list
else:
df1.loc[df1['Name'] == i,'Match_Flag'] = 'no'
df1.loc[df1['Name'] == i,'Match_List'] = ["-"]
Output of my approach:
line 611, in _setitem_with_indexer
raise ValueError('Must have equal len keys and value '
ValueError: Must have equal len keys and value when setting with an iterable
Since my approach is 1. not working and 2. it will be way too slow for millions of rows, I ask you to help me and find a more efficient and working approach please.