Trying to merge DataFrames with many conditions

Question

This is a weird one: I have 3 dataframes, "prov_data" with contains a provider id and counts on regions and categories (ie. how many times that provider interacted with those regions and categories).

prov_data = DataFrame({'aprov_id':[1122,3344,5566,7788],'prov_region_1':[0,0,4,0],'prov_region_2':[2,0,0,0],
                  'prov_region_3':[0,1,0,1],'prov_cat_1':[0,2,0,0],'prov_cat_2':[1,0,3,0],'prov_cat_3':[0,0,0,4],
                   'prov_cat_4':[0,3,0,0]})

"tender_data" which contains the same but for tenders.

tender_data = DataFrame({'atender_id':['AA12','BB33','CC45'],
                     'ten_region_1':[0,0,1,],'ten_region_2':[0,1,0],
                  'ten_region_3':[1,1,0],'ten_cat_1':[1,0,0],
                     'ten_cat_2':[0,1,0],'ten_cat_3':[0,1,0],
                   'ten_cat_4':[0,0,1]})

And finally a "no_match" DF wich contains forbidden matches between provider and tender.

no_match = DataFrame({ 'prov_id':[1122,3344,5566], 
            'tender_id':['AA12','BB33','CC45']})

I need to do the following: create a new df that will append the rows of the prov_data & tender_data DataFrames if they (1) match one or more categories (ie the same category is > 0) AND (2) match one or more regions AND (3) are not on the no_match list.

So that would give me this DF:

df = DataFrame({'aprov_id':[1122,3344,7788],'prov_region_1':[0,0,0],'prov_region_2':[2,0,0],
                  'prov_region_3':[0,1,1],'prov_cat_1':[0,2,0],'prov_cat_2':[1,0,0],'prov_cat_3':[0,0,4],
                   'prov_cat_4':[0,3,0], 'atender_id':['BB33','AA12','BB33'],
                     'ten_region_1':[0,0,0],'ten_region_2':[1,0,1],
                  'ten_region_3':[1,1,1],'ten_cat_1':[0,1,0],
                     'ten_cat_2':[1,0,1],'ten_cat_3':[1,0,1],
                   'ten_cat_4':[0,0,0]})

piRSquared · Accepted Answer · 2016-11-10T07:26:15.013

code

# the first columns of each dataframe are the ids
# i'm going to use them several times
tid = tender_data.values[:, 0]
pid = prov_data.values[:, 0]
# first columns [1, 2, 3, 4] are cat columns
# we could have used filter, but this is good
# for this example
pc = prov_data.values[:, 1:5]
tc = tender_data.values[:, 1:5]
# columns [5, 6, 7] are rgn columns
pr = prov_data.values[:, 5:]
tr = tender_data.values[:, 5:]

# I want to mave this an m x n array, where
# m = number of rows in prov df and n = rows in tender
nm = no_match.groupby(['prov_id', 'tender_id']).size().unstack()
nm = nm.reindex_axis(tid, 1).reindex_axis(pid, 0)
nm = ~nm.fillna(0).astype(bool).values * 1

# the dot products of the cat arrays gets a handy
# array where there are > 1 co-positive values
# this combined with the a no_match construct
a = pd.DataFrame(pc.dot(tc.T) * pr.dot(tr.T) * nm > 0, pid, tid)
a = a.mask(~a).stack().index

fp = a.get_level_values(0)
ft = a.get_level_values(1)

pd.concat([
        prov_data.set_index('aprov_id').loc[fp].reset_index(),
        tender_data.set_index('atender_id').loc[ft].reset_index()
    ], axis=1)


   index  prov_cat_1  prov_cat_2  prov_cat_3  prov_cat_4  prov_region_1  \
0   1122           0           1           0           0              0   
1   3344           2           0           0           3              0   
2   7788           0           0           4           0              0   

   prov_region_2  prov_region_3 atender_id  ten_cat_1  ten_cat_2  ten_cat_3  \
0              2              0       BB33          0          1          1   
1              0              1       AA12          1          0          0   
2              0              1       BB33          0          1          1   

   ten_cat_4  ten_region_1  ten_region_2  ten_region_3  
0          0             0             1             1  
1          0             0             0             1  
2          0             0             1             1

explanation

use dot products to determine matches
many other things I'll try to explain more later

thanks! this worked perfectly, the original dfs are 600K x 75 so this is a very good option. — castor, Nov 10 '16 at 20:03
I'm actually getting an exit code 137 after running it with the original dataframes. I think the main issue is the no_match DF, which has 600K rows. The provider matrix has 43K and the tender matrix 140K... any ideas @piRSquared? — castor, Nov 10 '16 at 21:59
http://stackoverflow.com/a/1041309/2336654 if that link is correct, I'm guessing you are running on a server where an admin killed your script. — piRSquared, Nov 10 '16 at 22:01

score 0 · Answer 2 · answered Nov 10 '16 at 07:54

Straightforward solution that uses only "standard" pandas techniques.

prov_data['tkey'] = 1
tender_data['tkey'] = 1
df1 = pd.merge(prov_data,tender_data,how='outer',on='tkey')
df1 = pd.merge(df1,no_match,how='outer',left_on = 'aprov_id', right_on = 'prov_id')
df1['dropData'] = df1.apply(lambda x: True if x['tender_id'] == x['atender_id'] else False, axis=1)
df1['dropData'] = df1.apply(lambda x: (x['dropData'] == True) or not(
                                     ((x['prov_cat_1'] > 0 and x['ten_cat_1'] > 0) or
                                      (x['prov_cat_2'] > 0 and x['ten_cat_2'] > 0) or
                                      (x['prov_cat_3'] > 0 and x['ten_cat_3'] > 0) or
                                      (x['prov_cat_4'] > 0 and x['ten_cat_4'] > 0)) and(
                                      (x['prov_region_1'] > 0 and x['ten_region_1'] > 0) or
                                      (x['prov_region_2'] > 0 and x['ten_region_2'] > 0) or
                                      (x['prov_region_3'] > 0 and x['ten_region_3'] > 0))),axis=1)
df1 = df1[~df1.dropData]
df1 = df1[[u'aprov_id', u'atender_id', u'prov_cat_1', u'prov_cat_2', u'prov_cat_3',
          u'prov_cat_4', u'prov_region_1', u'prov_region_2', u'prov_region_3',
          u'ten_cat_1', u'ten_cat_2', u'ten_cat_3', u'ten_cat_4', u'ten_region_1',
          u'ten_region_2', u'ten_region_3']].reset_index(drop=True)

print df1.equals(df)

First we do a full cross product of both dataframes and merge that with the no_match dataframe, then add a boolean column to mark all rows to be dropped.

The boolean column is assigned by two boolean lambda functions with all the necessary conditions, then we just take all rows where that column is False.

This solution isn't very ressource-friendly due to the merge operations, so if your data is very large it may be disadvantageous.

thanks! yes, my original DF has 600K rows and 76 columns, so the first option is more viable. Like you said, his is a good straight-forward option. — castor, Nov 10 '16 at 20:02

Trying to merge DataFrames with many conditions

2 Answers2