Find Multiple Column Combination in Another Pandas Dataframe with Wildcards

Question

I am trying to perform a daily reconciliation which will check a master dataset in a pandas dataframe, for the existence various combinations in another dataset (the reconciliation key), which contains wildcards.

Here is an illustrative representation of the issue (not using real information from my rec):

Master Dataset

FirstName	LastName	Occupation	Gender
Angela	Norris	Firefighter	Female
Angela	Thompson	Teacher	Female
Ben	Johnson	Police Officer	Male
Ben	Peterson	Solicitor	Male
Charlie	Davies	Nurse	Male
Debbie	Smith	Lawyer	Female

Reconciliation Key

FirstName	LastName	Occupation	Gender
Angela	*	*	*
Ben	Johnson	*	Male
Debbie	*	*	Female

So what I am trying to do here is find in the master dataset:

Any records where FirstName = Angela (with three wildcards for LastName, Occupation, Gender)
Any records where FirstName = Ben, LastName = Johnson and Gender = Male (with one wildcard on Occupation)
Any records where FirstName = Debbie and Gender = Female (with two wildcards for LastName and Occupation)

I cannot work out how to do this task... I would normally create a unique key with the combinations of FirstName, LastName, Occupation and Gender, and then merge the two datasets, but with wildcards, this isn't working for me.

I also tried a cartesian product (which I could later filter), but the size of the dataset I am working with resulted in a memory issue, so that wasn't working either.

Ideally, the output of the check would give this table (i.e. showing rows in the master dataset that meet the criteria in the reconciliation key):

FirstName	LastName	Occupation	Gender
Angela	Norris	Firefighter	Female
Angela	Thompson	Teacher	Female
Ben	Johnson	Police Officer	Male
Debbie	Smith	Lawyer	Female

sharmu1 · Answer 1 · 2023-03-10T11:32:07.880

Assuming FirstName is always present in the reconciliation key, this should do the trick:

Setup:

import numpy as np
import pandas as pd

master_df = pd.DataFrame({
    'FirstName': ['Angela', 'Angela', 'Ben', 'Ben', 'Charlie', 'Debbie'],
    'LastName': ['N', 'T', 'J', 'P', 'D', 'S'],
    'Occupation': ['Firefighter', 'Teacher', 'Police', 'Solicitor', 'Nurse', 'Lawyer'],
    'Gender': ['Female', 'Female', 'Male', 'Male', 'Male', 'Female'],
})

recon_df = pd.DataFrame({
    'FirstName': ['Angela', 'Ben', 'Debbie'],
    'LastName': ['*', 'J', '*'],
    'Occupation': ['*', '*', '*'],
    'Gender': ['*', 'Male', 'Female'],
})

Solution:

temp_df = master_df.merge(
    recon_df,
    how='left',
    on='FirstName',
)

update_cols = [col for col in temp_df.columns if col[-2:] == '_x']
for col in update_cols:
    y_col = col[:-2] + '_y'
    temp_df[col] = np.where(
        temp_df[y_col] == '*',
        temp_df[y_col],
        temp_df[col],
    )
temp_df = temp_df.rename(columns={col: col[:-2] for col in update_cols})
temp_df = temp_df.drop(columns=[col for col in temp_df.columns if col[-2:] == '_y'])

keys = list(recon_df.columns.values)
index1 = temp_df.set_index(keys).index
index2 = recon_df.set_index(keys).index

output_df = master_df.loc[index1.isin(index2)]

Output:

  FirstName LastName   Occupation  Gender
0    Angela        N  Firefighter  Female
1    Angela        T      Teacher  Female
2       Ben        J       Police    Male
5    Debbie        S       Lawyer  Female

(Built on solution here)

I am getting an "IndexError: Boolean index has wrong length" error // My index1 and index2 are different lengths, which is creating the error as the .loc method with the boolean index from the .isin method requires index1 and index2 to have the same length. — Laithan Morisco-Tarr, Mar 10 '23 at 17:34
Odd...in my example above index2 is smaller. So long as index2 isn't bigger it should be fine. Does it work for you with my dummy data? — sharmu1, Mar 11 '23 at 08:26
I think I have finally realised the issue. My recon_df table has multiple rows for the same FirstName (so we could have "Angela * Police *", and "Angela * Teacher *". When the merge happens, this creates the issue. Do you know whether the merge could be smarter to differentiate between two "Angela" rows in the recon_df? — Laithan Morisco-Tarr, Mar 13 '23 at 13:11

score 0 · Accepted Answer · answered Mar 13 '23 at 13:43

It seems as though I was able to resolve this by defining a function that checked whether the row contained LastName = Wildcard or LastName = LastName, and Occupation = Wildcard or Occupation = Occupation.

This is my code:

Setup:

import numpy as np
import pandas as pd

master_df = pd.DataFrame({
    'FirstName': ['Angela', 'Angela', 'Ben', 'Ben', 'Charlie', 'Debbie'],
    'LastName': ['N', 'T', 'J', 'P', 'D', 'S'],
    'Occupation': ['Firefighter', 'Teacher', 'Police', 'Solicitor', 'Nurse', 'Lawyer'],
    'Gender': ['Female', 'Female', 'Male', 'Male', 'Male', 'Female'],
})

recon_df = pd.DataFrame({
    'FirstName': ['Angela', 'Ben', 'Debbie'],
    'LastName': ['*', 'J', '*'],
    'Occupation': ['*', '*', '*'],
    'Gender': ['*', 'Male', 'Female'],
})

Solution:

def matches_wildcard(row):
    recon_row = recon_df.loc[(recon_df['FirstName'] == row['FirstName']) & 
                             ((recon_df['LastName'] == '*') | (recon_df['LastName'] == row['LastName'])) & 
                             ((recon_df['Occupation'] == '*') | (recon_df['Occupation'] == row['Occupation'])) & 
                             ((recon_df['Gender'] == '*') | (recon_df['Gender'] == row['Gender']))]
    return not recon_row.empty

matched_df = master_df[master_df.apply(matches_wildcard, axis=1)]

print(matched_df)

Output:

FirstName	LastName	Occupation	Gender
Angela	N	Firefighter	Female
Ben	J	Police	Male
Debbie	S	Lawyer	Female

Find Multiple Column Combination in Another Pandas Dataframe with Wildcards

2 Answers2