1

I am working on an experiment design, where I need to split a dataframe df into a control and treatment group by % by pre-existing groupings.

This is the dataframe df:

df.head()

customer_id | Group | many other columns
ABC             1
CDE             1
BHF             2
NID             1
WKL             2
SDI             2

pd.pivot_table(df,index=['Group'],values=["customer_id"],aggfunc=lambda x: len(x.unique()))

Group 1  : 55394
Group 2  : 34889

Now I need to add a column labeled "Flag" into the df. For Group 1, I want to randomly assign 50% "Control" and 50% "Test". For Group 2, I want to randomly assign 40% "Control" and 60% "Test".

The output I am looking for:

customer_id | Group | many other columns | Flag
ABC             1                          Test
CDE             1                          Control
BHF             2                          Test
NID             1                          Test
WKL             2                          Control
SDI             2                          Test
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
jeangelj
  • 4,338
  • 16
  • 54
  • 98

1 Answers1

5

we can use numpy.random.choice() method:

In [160]: df['Flag'] = \
     ...: df.groupby('Group')['customer_id']\
     ...:   .transform(lambda x: np.random.choice(['Control','Test'], len(x), 
                                                  p=[.5,.5] if x.name==1 else [.4,.6]))
     ...:

In [161]: df
Out[161]:
  customer_id  Group     Flag
0         ABC      1  Control
1         CDE      1     Test
2         BHF      2     Test
3         NID      1  Control
4         WKL      2     Test
5         SDI      2  Control

UPDATE:

In [8]: df
Out[8]:
  customer_id  Group
0         ABC      1
1         CDE      1
2         BHF      2
3         NID      1
4         WKL      2
5         SDI      2
6         XXX      3
7         XYZ      3
8         XXX      3

In [9]: d = {1:[.5,.5], 2:[.4,.6], 3:[.2,.8]}

In [10]: df['Flag'] = \
    ...: df.groupby('Group')['customer_id'] \
    ...:   .transform(lambda x: np.random.choice(['Control','Test'], len(x), p=d[x.name]))
    ...:

In [11]: df
Out[11]:
  customer_id  Group     Flag
0         ABC      1     Test
1         CDE      1     Test
2         BHF      2  Control
3         NID      1  Control
4         WKL      2  Control
5         SDI      2     Test
6         XXX      3     Test
7         XYZ      3     Test
8         XXX      3     Test
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
  • Beautiful couldn't agree more – Bharath M Shetty Oct 03 '17 at 15:54
  • @Bharathshetty, thank you! :) Appreciate your comment! – MaxU - stand with Ukraine Oct 03 '17 at 15:55
  • @MaxU, thank you very much - I am testing it right now - if I would have 3 groups instead of 2, as I already can see with my next project; how would I adjust the if/else statement since it only allows for 2 groups? If you prefer that I ask a new question for that, please let me know – jeangelj Oct 03 '17 at 19:06
  • It worked, thank you - I posted a new question about 3 groups fyi https://stackoverflow.com/questions/46552395/python-pandas-assign-control-vs-treatment-groupings-randomly-based-on-for-mo – jeangelj Oct 03 '17 at 19:56