1

I need to split a dataframe df randomly into control, treatment 1 and treatment 2 groups by %, that is, 10% , 45%, 45% respectively per each day.

This is the sample dataframe,

    Date          Customer_ID

 1. 2018-05-01         411
 2. 2018-05-01         414
 3. 2018-05-01         421
 4. 2018-05-01         431
 5. 2018-05-01         433
 6. 2018-05-02         441
 7. 2018-05-02         442
 8. 2018-05-02         443
 9. 2018-05-02         444

My final outcome should look like this:

      Date       Customer_ID    Group

 1. 2018-05-01       411         Control
 2. 2018-05-01       414         Treatment 1
 3. 2018-05-01       421         Treatment 1
 4. 2018-05-01       431         Treatment 2
 5. 2018-05-01       433         Treatment 2
 6. 2018-05-02       441         Treatment 1
 7. 2018-05-02       442         Treatment 2
 8. 2018-05-02       443         Treatment 2
 9. 2018-05-02       444         Treatment 1

For each day control, treatment 1 and treatment 2 groups have to be 10%, 45% and 45 % respectively of the total number of customers on that particular day.

Upon cross tabulating the final outcome data by date and the new group variable created, I should get the following percentages of each category per total number of customers per day:

  Date                          Group


                  Control       Treatment1     Treatment2

 1. 2018-05-01      10 %         45%           45%
 2. 2018-05-02      10 %         45%           45%
 3. 2018-05-03      10 %         45%           45%
 4. 2018-05-04      10 %         45%           45%
      :
      :

I have tried following this post on stack overflow ( Assign control vs. treatment groupings randomly based on % for more than 2 groups) but it does not give me the actual percentages I want for each day . Below is the code I used on the dataframe

df['Groups'] = df.groupby('date')['Customer_ID'].transform(lambda x: np.random.choice(['Control','treatment_1', 'treatment_2'], len(x),  p=[0.1,0.45,0.45]))

I obtain the following percentages (number of counts per each category/ total number of counts per day):

  Date                          Group


                  Control       Treatment1     Treatment2

 1. 2018-05-01      8.42 %      47.37%          44.21%
 2. 2018-05-02      11.4 %      36.84%          51.75%
 3. 2018-05-03      11.68 %     51.09%          37.23%
 4. 2018-05-04       8.26 %     46.79%          44.95%
      :
      :

Is there a way to do this?

Nicola M
  • 31
  • 3
  • 1
    Possible duplicate of [How to split data into 3 sets (train, validation and test)?](https://stackoverflow.com/questions/38250710/how-to-split-data-into-3-sets-train-validation-and-test) – Chris Feb 22 '19 at 07:08
  • Is the assigned group random (as long as the final group assignment satisfies the group sizes) or is there a logic behind that? – Lomtrur Feb 22 '19 at 07:11
  • Yes, the assignment is random, but note that, for **each day** the final group assignment has to satisfy the group sizes – Nicola M Feb 22 '19 at 07:23
  • @Chris I have looked at this https://stackoverflow.com/questions/38250710/how-to-split-data-into-3-sets-train-validation-and-test and it does not give the output that I require, hence not a duplicate. Thanks. – Nicola M Feb 22 '19 at 08:44
  • @NicolaM On average, how many unique customers do you expect to have? Also, is there a chance than a customer ID will appear twice for the same day? – JimmyA Feb 22 '19 at 14:17
  • @RockHardRacoon, no there is no chance that the customer ID will appear twice for the same day. On average, I expect to have 105 unique customers. Thanks. – Nicola M Feb 25 '19 at 07:15

0 Answers0