I have a dataframe of 6M+ observations, where 20 of the columns are weights that will be applied to a single score column. I.e., Wgt1 * Wgt2 * Wgt3...* Score. In addition, not each weight is applicable to every observation, so I have created 20 columns that represent a weight mask. I.e., (Wgt1*Msk1) * (Wgt2*Msk2) * (Wgt3*Msk3) ... Score. When the mask is 0, the weight is not applicable; when the mask is 1, it is applicable.
For each row in the dataframe, I want to: 1, Check 2 qualifying conditions that indicate the row should be processed 2, find the product of the weights, subject to the presence of the corresponding mask (ttl_wgt) 3, multiply this product by the score (prob) to create a final weighted score
To do this, I have created a user-defined function:
import functools
import operator
import time
def mymult(a):
ttl_wgt = float('NaN') #Initialize to NaN
if ~np.isnan(a['ID']): #condition 1, only process if an ID is present
if a['prob'] > -1.0: #condition 2, only process if our unweighted score is NOT -1.0
b = np.where(a[msks] ==1)[0] #index for which of our masks is 1?
ttl_wgt = functools.reduce(operator.mul, a[np.asarray(wgt_nms)[b]], 1)
return ttl_wgt
I ran out of memory during development, so I decided to chunk it up into 500000 rows at a time. I use a lambda function to apply to the chunk:
msks = ['Msk1','Msk2','Msk3','Msk4',...,'Msk20']
wgt_nms = ['Wgt1','Wgt2','Wgt3','Wgt4',...,'Wgt20']
print('Determining final weights...')
chunksize = 500000 #we'll operate on this many rows at a time
start_time = time.time()
ttl_wgts = [] #initialize list to hold weight products
for i in range(0,len(df),chunksize):
ttl_wgts.extend(df[i:(i+chunksize)].apply(lambda x: mymult(x), axis=1))
print("--- %s seconds ---" % (time.time() - start_time)) #Expect between 30 and 40 minutes
print('Done!')
Then I assignthe ttl_wgts list as a new column in the dataframe, and do the final product of weight * initial score.
#Initialize the fields
#Might not be necessary or evenuseful
df['ttl_wgt'] = float('NaN')
df['wgt_prob'] = float('NaN')
df['ttl_wgt'] = ttl_wgts
df['wgt_prob'] = df['ttl_wgt'] * df['prob']
I checked out a prior post on multiplying elements in a list. It was great food for thought, but I wasn't able to turn it into anything more efficient for my 6M+ observations. Are there other approaches I should be considering?
Adding an example df, as suggested
A sample of the dataframe might looks something like this, with only 3 masks/weights:
df = pd.DataFrame({'id': [999999999,136550,80010170,80010177,90002408,90002664,16207501,62992,np.nan,80010152],
'prob': [-1,0.180274382,0.448361456,0.000945058,0.005060279,0.009893078,0.169686288,0.109541453,0.117907763,0.266242921],
'Msk1': [0,1,1,1,0,0,1,0,0,0],
'Msk2': [0,0,1,0,0,0,0,1,0,0],
'Msk3': [1,0,0,0,1,1,0,0,1,1],
'Wgt1': [np.nan,0.919921875,1.08984375,1.049804688,np.nan,np.nan,np.nan,0.91015625,np.nan,0.810058594],
'Wgt2': [np.nan,1.129882813,1.120117188,0.970214844,np.nan,np.nan,np.nan,1.0703125,np.nan,0.859863281],
'Wgt3': [np.nan,1.209960938,1.23046875,1,np.nan,np.nan,np.nan,1.150390625,np.nan,0.649902344]
})
In the first observation, the prob field is -1, so the row would not be processed. In the second observation, Msk1 is turned on while Msk2 and Msk3 are turned off. Thus the final weight would be the value of Wgt1 = 0.919922. In the 3rd row, Mask1 and Msk2 are on, while Msk3 is off. Therefore the final weight would be Wgt1*Wgt2 = 1.089844*1.120117 = 1.220752.