1

I have a dataframe of 6M+ observations, where 20 of the columns are weights that will be applied to a single score column. I.e., Wgt1 * Wgt2 * Wgt3...* Score. In addition, not each weight is applicable to every observation, so I have created 20 columns that represent a weight mask. I.e., (Wgt1*Msk1) * (Wgt2*Msk2) * (Wgt3*Msk3) ... Score. When the mask is 0, the weight is not applicable; when the mask is 1, it is applicable.

For each row in the dataframe, I want to: 1, Check 2 qualifying conditions that indicate the row should be processed 2, find the product of the weights, subject to the presence of the corresponding mask (ttl_wgt) 3, multiply this product by the score (prob) to create a final weighted score

To do this, I have created a user-defined function:

import functools
import operator
import time    
def mymult(a):
        ttl_wgt = float('NaN') #Initialize to NaN
        if ~np.isnan(a['ID']): #condition 1, only process if an ID is present
            if a['prob'] > -1.0: #condition 2, only process if our unweighted score is NOT -1.0
                b = np.where(a[msks] ==1)[0] #index for which of our masks is 1?
                ttl_wgt = functools.reduce(operator.mul, a[np.asarray(wgt_nms)[b]], 1)
        return ttl_wgt

I ran out of memory during development, so I decided to chunk it up into 500000 rows at a time. I use a lambda function to apply to the chunk:

msks = ['Msk1','Msk2','Msk3','Msk4',...,'Msk20']
wgt_nms = ['Wgt1','Wgt2','Wgt3','Wgt4',...,'Wgt20']
print('Determining final weights...')
chunksize = 500000 #we'll operate on this many rows at a time
start_time = time.time()
ttl_wgts = [] #initialize list to hold weight products
for i in range(0,len(df),chunksize): 
    ttl_wgts.extend(df[i:(i+chunksize)].apply(lambda x: mymult(x), axis=1))
print("--- %s seconds ---" % (time.time() - start_time)) #Expect between 30 and 40 minutes
print('Done!')

Then I assignthe ttl_wgts list as a new column in the dataframe, and do the final product of weight * initial score.

#Initialize the fields
#Might not be necessary or evenuseful
df['ttl_wgt'] = float('NaN')
df['wgt_prob'] = float('NaN')

df['ttl_wgt'] = ttl_wgts
df['wgt_prob'] = df['ttl_wgt'] * df['prob']

I checked out a prior post on multiplying elements in a list. It was great food for thought, but I wasn't able to turn it into anything more efficient for my 6M+ observations. Are there other approaches I should be considering?

Adding an example df, as suggested

A sample of the dataframe might looks something like this, with only 3 masks/weights:

df = pd.DataFrame({'id': [999999999,136550,80010170,80010177,90002408,90002664,16207501,62992,np.nan,80010152], 
                   'prob': [-1,0.180274382,0.448361456,0.000945058,0.005060279,0.009893078,0.169686288,0.109541453,0.117907763,0.266242921],
                   'Msk1': [0,1,1,1,0,0,1,0,0,0],
                   'Msk2': [0,0,1,0,0,0,0,1,0,0],
                   'Msk3': [1,0,0,0,1,1,0,0,1,1],
                   'Wgt1': [np.nan,0.919921875,1.08984375,1.049804688,np.nan,np.nan,np.nan,0.91015625,np.nan,0.810058594],
                   'Wgt2': [np.nan,1.129882813,1.120117188,0.970214844,np.nan,np.nan,np.nan,1.0703125,np.nan,0.859863281],
                   'Wgt3': [np.nan,1.209960938,1.23046875,1,np.nan,np.nan,np.nan,1.150390625,np.nan,0.649902344]
                   })

In the first observation, the prob field is -1, so the row would not be processed. In the second observation, Msk1 is turned on while Msk2 and Msk3 are turned off. Thus the final weight would be the value of Wgt1 = 0.919922. In the 3rd row, Mask1 and Msk2 are on, while Msk3 is off. Therefore the final weight would be Wgt1*Wgt2 = 1.089844*1.120117 = 1.220752.

Community
  • 1
  • 1
Amw 5G
  • 659
  • 5
  • 16
  • This is a pretty confusing question and would be much better if you provided a very small subset of your data with a desired outcome. See how to make a [good reproducible pandas example](http://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – Ted Petrou Dec 27 '16 at 13:20
  • The question was already getting kinda long, so I went back and forth on including an example. Thanks for the reminder, I've added one that hopefully clarifies the situation. – Amw 5G Dec 27 '16 at 13:54

1 Answers1

0

IIUC:

you want to fill in your masked weights with 1. Then you can multiply them all together with no impact from the ones being masked. That's the trick. You'll have to apply it as needed.

create msk

msk = df.filter(like='Msk')
print(msk)

   Msk1  Msk2  Msk3
0     0     0     1
1     1     0     0
2     1     1     0
3     1     0     0
4     0     0     1
5     0     0     1
6     1     0     0
7     0     1     0
8     0     0     1
9     0     0     1

create wgt

wgt = df.filter(like='Wgt')
print(wgt)

       Wgt1      Wgt2      Wgt3
0       NaN       NaN       NaN
1  0.919922  1.129883  1.209961
2  1.089844  1.120117  1.230469
3  1.049805  0.970215  1.000000
4       NaN       NaN       NaN
5       NaN       NaN       NaN
6       NaN       NaN       NaN
7  0.910156  1.070312  1.150391
8       NaN       NaN       NaN
9  0.810059  0.859863  0.649902

create new_weight

new_wgt = np.where(msk, wgt, 1)
print(new_wgt)

[[ 1.          1.                 nan]
 [ 0.91992188  1.          1.        ]
 [ 1.08984375  1.12011719  1.        ]
 [ 1.04980469  1.          1.        ]
 [ 1.          1.                 nan]
 [ 1.          1.                 nan]
 [        nan  1.          1.        ]
 [ 1.          1.0703125   1.        ]
 [ 1.          1.                 nan]
 [ 1.          1.          0.64990234]]

final prod_wgt

prod_wgt = pd.Series(new_wgt.prod(1), wgt.index)
print(prod_wgt)

0         NaN
1    0.919922
2    1.220753
3    1.049805
4         NaN
5         NaN
6         NaN
7    1.070312
8         NaN
9    0.649902
dtype: float64
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • I hadn't thought about passing the mask as a condition to the weights via numpy.where. But it makes complete sense now. This takes my ~30 minute kludge down to <10 seconds. – Amw 5G Dec 27 '16 at 17:30