I need to split a dataframe df randomly into control, treatment 1 and treatment 2 groups by %, that is, 10% , 45%, 45% respectively per each day.
This is the sample dataframe,
Date Customer_ID
1. 2018-05-01 411
2. 2018-05-01 414
3. 2018-05-01 421
4. 2018-05-01 431
5. 2018-05-01 433
6. 2018-05-02 441
7. 2018-05-02 442
8. 2018-05-02 443
9. 2018-05-02 444
My final outcome should look like this:
Date Customer_ID Group
1. 2018-05-01 411 Control
2. 2018-05-01 414 Treatment 1
3. 2018-05-01 421 Treatment 1
4. 2018-05-01 431 Treatment 2
5. 2018-05-01 433 Treatment 2
6. 2018-05-02 441 Treatment 1
7. 2018-05-02 442 Treatment 2
8. 2018-05-02 443 Treatment 2
9. 2018-05-02 444 Treatment 1
For each day control, treatment 1 and treatment 2 groups have to be 10%, 45% and 45 % respectively of the total number of customers on that particular day.
Upon cross tabulating the final outcome data by date and the new group variable created, I should get the following percentages of each category per total number of customers per day:
Date Group
Control Treatment1 Treatment2
1. 2018-05-01 10 % 45% 45%
2. 2018-05-02 10 % 45% 45%
3. 2018-05-03 10 % 45% 45%
4. 2018-05-04 10 % 45% 45%
:
:
I have tried following this post on stack overflow ( Assign control vs. treatment groupings randomly based on % for more than 2 groups) but it does not give me the actual percentages I want for each day . Below is the code I used on the dataframe
df['Groups'] = df.groupby('date')['Customer_ID'].transform(lambda x: np.random.choice(['Control','treatment_1', 'treatment_2'], len(x), p=[0.1,0.45,0.45]))
I obtain the following percentages (number of counts per each category/ total number of counts per day):
Date Group
Control Treatment1 Treatment2
1. 2018-05-01 8.42 % 47.37% 44.21%
2. 2018-05-02 11.4 % 36.84% 51.75%
3. 2018-05-03 11.68 % 51.09% 37.23%
4. 2018-05-04 8.26 % 46.79% 44.95%
:
:
Is there a way to do this?