I need to calculate the differences within groups for thousands of simulated experiments. Here is a simplified version of the data:
import pandas as pd
data = {
'experiment':['exp_1','exp_1','exp_1','exp_2','exp_2','exp_3','exp_3','exp_3','exp_3'],
'variation':['control','variation_1','variation_2','control','variation_1','control','variation_1','variation_2','variation_3'],
'revenue': [100,120,155,50,95,300,500,350,610]
}
df = pd.DataFrame(data,columns=['experiment','variation','revenue'])
In [16]: df
Out[16]:
experiment variation revenue
0 exp_1 control 100
1 exp_1 variation_1 90
2 exp_1 variation_2 155
3 exp_2 control 50
4 exp_2 variation_1 95
5 exp_3 control 300
6 exp_3 variation_1 500
7 exp_3 variation_2 250
8 exp_3 variation_3 610
Notice that each experiment does not have a fixed number number of variations--there can be just an A/B test, or an A/B/C/D test.
The goal is to calculate the diffs by comparing each variation to the control within each experiment. That would give the following results for the example data:
In [17]: df
Out[17]:
experiment variation revenue diffs
0 exp_1 control 100 NaN
1 exp_1 variation_1 90 -10
2 exp_1 variation_2 155 55
3 exp_2 control 50 NaN
4 exp_2 variation_1 95 45
5 exp_3 control 300 NaN
6 exp_3 variation_1 500 200
7 exp_3 variation_2 250 -50
8 exp_3 variation_3 610 310
This answer is really close, but it is a cumulative diff at each point in time, rather than to a comparison to the first value in each group.