0

I have two very large tables df1 and df2 (multiple millions of rows each) of person-related data and each table has a column that contains the name of a person (column name: "Name"). The names of one and the same person can be written differently (e.g. "Jeff McGregor" or "Mr. J McGregor", etc.) among the two tables, which is why I want to apply fuzzy string matching with the fuzzywuzzy package in Python (this simply compares two strings and returns a similarity measure).

As an output (see df3 for the desired output table), I would like to fill the "Match_Flag" and the "Match_List" columns in the df1 according to the entries in df2. For every (unique) person in df1, I want to check if there are (fuzzy string) matches in the df2. If there is a string, the column "Match_Flag" should contain a "yes" and if not, a "no". The "Match_list" column should contain for every name a list of matches. If there is one match, the list would contain one entry and if there are e.g. three matches, the list would contain 3 matches. If there is no match, the list should be just empty.

This is the data:

df1

data_df1 = {'ID':[56382, 34732, 12423, 29574, 76532], 
           'Name':['Tom Hilley', 'Andreas Puthz', 'Jeff McGregor', 'Jack Ebbstein', 'Lisa Norwat'],
           'Match_Flag':["", "", "", "", ""],
           'Match_List':["", "", "", "", ""]} 

df1 = pd.DataFrame(data_df1) 

print(df1)

      ID           Name Match_Flag Match_List
0  56382     Tom Hilley                      
1  34732  Andreas Puthz                      
2  12423  Jeff McGregor                      
3  29574  Jack Ebbstein                      
4  76532    Lisa Norwat  

df2

data_df2 = {'Name':['Tom Hilley', 'Madalina Peter', 'Russel Cross', 'Jenni Pey', 'Kanush Hawks', 'Mr. J McGregor', 'Ebbstein Jack', 'Mr. Jack Ebbstein'],
           'Age':[16, 56, 33, 44, 24, 26, 86, 32]} 

df2 = pd.DataFrame(data_df2) 

print(df2)

                Name  Age
0         Tom Hilley   16
1     Madalina Peter   56
2       Russel Cross   33
3          Jenni Pey   44
4       Kanush Hawks   24
5     Mr. J McGregor   26
6      Ebbstein Jack   86
7  Mr. Jack Ebbstein   32

df3

data_df3 = {'ID':[56382, 34732, 12423, 29574, 76532], 
               'Name':['Tom Hilley', 'Andreas Puthz', 'Jeff McGregor', 'Jack Ebbstein', 'Lisa Norwat'],
               'Match_Flag':["yes", "no", "yes", "yes", "no"],
               'Match_List':[["Tom Hilley"], [], ["Mr. J McGregor"], ["Ebbstein Jack","Mr. Jack Ebbstein"], []]} 

df3 = pd.DataFrame(data_df3)

print(df3)

     ID           Name Match_Flag                          Match_List
0  56382     Tom Hilley        yes                        [Tom Hilley]
1  34732  Andreas Puthz         no                                  []
2  12423  Jeff McGregor        yes                    [Mr. J McGregor]
3  29574  Jack Ebbstein        yes  [Ebbstein Jack, Mr. Jack Ebbstein]
4  76532    Lisa Norwat         no                                  []

My approach:

# import libraries
import pandas as pd
from fuzzywuzzy import fuzz  

# create matching
for i in df1["Name"].unique().tolist():

    # initialize matching list
    matching_list = []

    for j in df2["Name"].unique().tolist():

        # create matching score
        if fuzz.token_set_ratio(i, j) >= 90:
            matching_list.append(j)

    # create red flags
    if matching_list:
        df1.loc[df1['Name'] == i,'Match_Flag'] = 'yes'
        df1.loc[df1['Name'] == i,'Match_List'] = matching_list
    else:
        df1.loc[df1['Name'] == i,'Match_Flag'] = 'no'
        df1.loc[df1['Name'] == i,'Match_List'] = ["-"]

Output of my approach:

line 611, in _setitem_with_indexer
    raise ValueError('Must have equal len keys and value '

ValueError: Must have equal len keys and value when setting with an iterable

Since my approach is 1. not working and 2. it will be way too slow for millions of rows, I ask you to help me and find a more efficient and working approach please.

constiii
  • 638
  • 3
  • 19

2 Answers2

-2

base on this topic I believe merging those two dataframes are a lot more efficient than iterate through the whole data. since you want matched names, you should use inner join.

Matt W.
  • 3,692
  • 2
  • 23
  • 46
prhmma
  • 843
  • 10
  • 18
  • 1
    fuzzy match is giving an approximation of match, not _exact_ match, thus a join wont work. Normally I would probably just cross join the two tables (not matching on anything), and .apply the function, but multi million row dataframes would create a trillion+ row dataframe which wouldn't work. – Matt W. Oct 26 '19 at 13:38
  • this is a bit different, you can see in the topic that there a bit of pre-processing there to make this join fuzzy – prhmma Oct 26 '19 at 13:39
  • oooh interesting. I didn't see the link. That might work! – Matt W. Oct 26 '19 at 13:55
-2

This answer might take a while to run, but should work.

I imported names to create larger dataframes with random names.

import pandas as pd
from fuzzywuzzy import fuzz 
import random
import os
import names



id_col = range(10000)
name_col = [names.get_full_name() for _ in range(10000)]
df1 = pd.DataFrame({'ID':id_col, 'name_col':name_col})

age = [random.randint(1, 95) for _ in range(10000)]
name_col2 = [names.get_full_name() for _ in range(10000)]
df2 = pd.DataFrame({'name_col2':name_col2, 'age':age})

Since we want to iterate through df1, I dropped duplicates of the name column. We're going to do a cross join to bring the whole row of the dataframe into the 2nd dataframe, so I assigned v=1

df1_deduped = df1.drop_duplicates('name_col')
df2 = df2.assign(v=1)

define the fuzzy function to use in .apply

def func(row):
    return fuzz.token_set_ratio(row['name_col'], row['name_col2'])

Here we're going to loop through the length of the first dataframe, and for every row (unique name), we're joining it to the 2nd dataframe. We then .apply the fuzzy function to a tokenthresh column, and filter down the dataframe by the threshold 70. If there are any matches, it writes it to a csv. This way it's not all done in memory which will mostly likely be an issue for you with multi-million row dataframes on both sides. This will chunk it into pieces. Alternatively instead of going row by row into a million row dataframe, you could do it in 5s or 10s, that could slow it down, I'm not sure.

for i in range(len(df1_deduped)):
    df3 = pd.merge(df1.assign(v=1).iloc[[i],:], df2, on='v').drop(['v'], axis=1)
    df3['tokenthresh'] = df3.apply(func, axis=1)
    df3 = df3[df3.tokenthresh > 70]
    print('there are', len(df3), 'records that exceeded the threshold')
    if len(df3) > 0:
        df3.to_csv(str(i)+'.csv', index=False)

We then can read in the files that were created:

files = []
for file in os.listdir():
    files.append(pd.read_csv(file))
data = pd.concat(files)

and lastly concat the different answers:

data['concat_group'] = data.groupby(['ID', 'name_col'])['name_col2'].transform(lambda x: ', '.join(x))
data = data.drop_duplicates(['ID', 'name_col'])
Matt W.
  • 3,692
  • 2
  • 23
  • 46
  • Thanks for your solution! Unfortunately, I get an error message: "ParserError: Error tokenizing data. C error: Expected 1 fields in line 7, saw 5". Also, it would be way to slow for the size of my data :/ – constiii Oct 26 '19 at 16:05
  • post the error? with the created datasets this does at least work. did you look into phalanx's link? – Matt W. Oct 27 '19 at 15:07