1

I need to calculate the differences within groups for thousands of simulated experiments. Here is a simplified version of the data:

import pandas as pd

data = {
    'experiment':['exp_1','exp_1','exp_1','exp_2','exp_2','exp_3','exp_3','exp_3','exp_3'],
    'variation':['control','variation_1','variation_2','control','variation_1','control','variation_1','variation_2','variation_3'],
    'revenue': [100,120,155,50,95,300,500,350,610]
}
df = pd.DataFrame(data,columns=['experiment','variation','revenue'])

In [16]: df
Out[16]:
  experiment    variation  revenue
0      exp_1      control      100
1      exp_1  variation_1       90
2      exp_1  variation_2      155
3      exp_2      control       50
4      exp_2  variation_1       95
5      exp_3      control      300
6      exp_3  variation_1      500
7      exp_3  variation_2      250
8      exp_3  variation_3      610

Notice that each experiment does not have a fixed number number of variations--there can be just an A/B test, or an A/B/C/D test.

The goal is to calculate the diffs by comparing each variation to the control within each experiment. That would give the following results for the example data:

In [17]: df
Out[17]:
  experiment    variation  revenue  diffs
0      exp_1      control      100    NaN
1      exp_1  variation_1       90    -10
2      exp_1  variation_2      155     55
3      exp_2      control       50    NaN
4      exp_2  variation_1       95     45
5      exp_3      control      300    NaN
6      exp_3  variation_1      500    200
7      exp_3  variation_2      250    -50    
8      exp_3  variation_3      610    310

This answer is really close, but it is a cumulative diff at each point in time, rather than to a comparison to the first value in each group.

measureallthethings
  • 1,102
  • 10
  • 26

2 Answers2

0

IIUC, we can using transform and fillna

df['diff']=np.nan

df['diff']=df.loc[df.variation!='control','diff'].\
                fillna(df.revenue-df.groupby('experiment').revenue.transform('first'))
df
Out[498]: 
  experiment    variation  revenue   diff
0      exp_1      control      100    NaN
1      exp_1  variation_1       90  -10.0
2      exp_1  variation_2      155   55.0
3      exp_2      control       50    NaN
4      exp_2  variation_1       95   45.0
5      exp_3      control      300    NaN
6      exp_3  variation_1      500  200.0
7      exp_3  variation_2      250  -50.0
8      exp_3  variation_3      610  310.0
BENY
  • 317,841
  • 20
  • 164
  • 234
0

I was able to accomplish this by making a new dataframe of only rows with variation == control, then outer-merging it back in to the original dataframe and subtracting the resulting columns.

controls = df[df.variation == 'control'][['experiment','revenue']]
merged = pd.merge(controls,df,on='experiment',how='outer',suffixes=['_control',''])
df['diffs'] = merged['revenue'] - merged['revenue_control']
n3utrino
  • 1,160
  • 1
  • 8
  • 16