Given a DataFrame, whats the best way to find rows in the DataFrame that partially match a list of given values.
Currently I have a rows of given values in a DataFrame (df1), I iterate through those then apply a function to each row of another DataFrame (df2) that counts how many values in the row match the conditions, then return a subset of the second DataFrame where the count is above a certain value.
def partialMatch(row, conditions):
count = 0
if(row['ResidenceZip'] == conditions['ResidenceZip']):
count+=1
if(row['FirstName'] == conditions['FirstName']):
count +=1
if(row['LastName'] == conditions['LastName']):
count +=1
if(row['Birthday'] == conditions['Birthday']):
count+=1
return count
concat_all = []
for i, row in df1.iterrows():
c = {'ResidenceZip': row['ResidenceZip'], 'FirstName':row['FirstName'],
'LastName': row['LastName'],'Birthday': row['Birthday']}
df2['count'] = df2.apply(lambda x: partialMatch(x, c), axis = 1)
x1 = df2[df2['count']>=3]
concat_all.append(x1)
This works, but is pretty slow. Any tips on speeding this process up?
For example, running the code on the two dataframes below, the first row of df1 would return the first three rows of df2 but not the last two.
df1
FirstName|LastName | Birthday | ResidenceZip
John | Doe | 1/1/2000 | 99999
Rob | A | 1/1/2010 | 19499
df2
FirstName|LastName | Birthday | ResidenceZip | count
John | Doe | 1/1/2000 | 99999 | 3
John | Doe | 1/1/2000 | 99999 | 3
John | Doex | 1/1/2000 | 99999 | 3
Joha | Doex | 1/1/2000 | 99999 | 2
Joha | Doex | 9/9/2000 | 99999 | 1
Rob | A | 9/9/2009 | 19499 | 0