0

Given a DataFrame, whats the best way to find rows in the DataFrame that partially match a list of given values.

Currently I have a rows of given values in a DataFrame (df1), I iterate through those then apply a function to each row of another DataFrame (df2) that counts how many values in the row match the conditions, then return a subset of the second DataFrame where the count is above a certain value.

def partialMatch(row, conditions):
    count = 0
    if(row['ResidenceZip'] == conditions['ResidenceZip']):
        count+=1
    if(row['FirstName'] == conditions['FirstName']):
        count +=1
    if(row['LastName'] == conditions['LastName']):
        count +=1
    if(row['Birthday'] == conditions['Birthday']):
        count+=1
    return count

concat_all = []
for i, row in df1.iterrows():
    c = {'ResidenceZip': row['ResidenceZip'], 'FirstName':row['FirstName'], 
         'LastName': row['LastName'],'Birthday': row['Birthday']}
    df2['count'] = df2.apply(lambda x: partialMatch(x, c), axis = 1)
    x1 = df2[df2['count']>=3]
    concat_all.append(x1)

This works, but is pretty slow. Any tips on speeding this process up?

For example, running the code on the two dataframes below, the first row of df1 would return the first three rows of df2 but not the last two.

df1
    FirstName|LastName | Birthday | ResidenceZip 
    John     |  Doe    | 1/1/2000 |  99999
    Rob      |  A      | 1/1/2010 |  19499

df2
    FirstName|LastName | Birthday | ResidenceZip | count
    John     |  Doe    | 1/1/2000 |  99999       | 3
    John     |  Doe    | 1/1/2000 |  99999       | 3
    John     |  Doex   | 1/1/2000 |  99999       | 3
    Joha     |  Doex   | 1/1/2000 |  99999       | 2
    Joha     |  Doex   | 9/9/2000 |  99999       | 1
    Rob      |  A      | 9/9/2009 |  19499       | 0

2 Answers2

1

Not sure if there's a way around looping over at least one DataFrame, but here's one option that might speed things up. It does allow for the accidental comparison of FirstName with LastName, though that can be avoided by adding a unique prefix to the values (like '@' for first name and '&' for last name)

import numpy as np

s1 = [set(x) for x in df1.values]
s2 = [set(x) for x in df2.values]
masks = np.reshape([len(x & y) >= 3 for x in s1 for y in s2], (len(df1), -1))
concat_all = [df2[m] for m in masks]

Output concat_all

[  FirstName LastName  Birthday  ResidenceZip
 0      John      Doe  1/1/2000         99999
 1      John      Doe  1/1/2000         99999
 2      John     Doex  1/1/2000         99999,
   FirstName LastName  Birthday  ResidenceZip
 5       Rob        A  9/9/2009         19499]

Timings

def Alollz(df1, df2):
    s1 = [set(x) for x in df1.values]
    s2 = [set(x) for x in df2.values]
    masks = np.reshape([len(x & y) >= 3 for x in s1 for y in s2], (len(df1), -1))
    concat_all = [df2[m] for m in masks]
    return concat_all

def SharpObject(df1, df2):
    concat_all = []
    for i, row in df1.iterrows():
        c = {'ResidenceZip': row['ResidenceZip'], 'FirstName':row['FirstName'], 
             'LastName': row['LastName'],'Birthday': row['Birthday']}
        df2['count'] = df2.apply(lambda x: partialMatch(x, c), axis = 1)
        x1 = df2[df2['count']>=3]
        concat_all.append(x1)
    return concat_all

%timeit Alollz(df1, df2)
#785 µs ± 5.26 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

%timeit SharpObject(df1, df2)
#3.56 ms ± 44.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

And larger:

# you should never append dfs like this in a loop
for i in range(7):
    df1 = df1.append(df1)
    df2 = df2.append(df2)

%timeit Alollz(df1, df2)
#132 ms ± 248 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit SharpObject(df1, df2)
#6.88 s ± 11.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
ALollz
  • 57,915
  • 7
  • 66
  • 89
1

Using the numpy isin function:

df1_vals = df1.values
df2_vals = df2.values
df1_rows = range(df1_vals.shape[0])

concat_all = \
    [df2[np.add.reduce(np.isin(df2_vals, df1_vals[row]), axis=1) >= 3] for row in df1_rows]

Here are the dataframes for setup:

df1 = pd.DataFrame({'FirstName': ['John', 'Rob'],
                    'LastName': ['Doe', 'A'],
                    'Birthday': ['1/1/2000', '9/9/2009'],
                    'ResidenceZip': [99999, 19499]})

df2 = pd.DataFrame({'FirstName': ['John', 'John', 'John', 'Joha', 'Joha', 'Rob'],
                    'LastName': ['Doe', 'Doe', 'Doex', 'Doex', 'Doex', 'A'],
                    'Birthday': ['1/1/2000', '1/1/2000', '1/1/2000', '1/1/2000', '9/9/2000', '9/9/2009'],
                    'ResidenceZip': [99999, 99999, 99999, 99999, 99999, 19499]})
b2002
  • 914
  • 1
  • 6
  • 10