Assign control vs. treatment groupings randomly based on % for more than 2 groups

Question

Piggy backing off my own previous question python pandas: assign control vs. treatment groupings randomly based on %

Thanks to @maxU, I know how to assign random control/treatment groupings to 2 groups; but what if I have 3 groups or more?

For example:

df.head()

customer_id | Group | many other columns
ABC             1
CDE             3
BHF             2
NID             1
WKL             3
SDI             2
JSK             1
OSM             3
MPA             2
MAD             1

pd.pivot_table(df,index=['Group'],values=["customer_id"],aggfunc=lambda x: len(x.unique()))

Group 1  : 270
Group 2  : 180
Group 3  : 330

I have a great answer, when I only have two groups:

df['Flag'] = df.groupby('Group')['customer_id']\
             .transform(lambda x: np.random.choice(['Control','Test'], len(x), 
                                                  p=[.5,.5] if x.name==1 else [.4,.6]))

But what if i want to split it this way:

Group 1: 50% Control & 50% Test
Group 2: 40% Control & 60% Test
Group 3: 20% Control & 80% Test

@MaxU's answer is great, but unfortunately the split is not exact

d = {1:[.5,.5], 2:[.4,.6], 3:[.2,.8]}

df['Flag'] = df.groupby('Group')['customer_id'] \
             .transform(lambda x: np.random.choice(['Control','Test'], len(x), p=d[x.name]))

When i test it, I don't get exact splits.

pd.pivot_table(df,index=['Group'],values=["customer_id"],columns=['Flag'], aggfunc=lambda x: len(x.unique()))

           Control  Treatment
Group 1:    138       132
Group 2:    78        102
Group 3:    79        251

Group 1 should be 135/135.

score 2 · Answer 1 · answered Oct 03 '17 at 20:01

2

In [13]: df
Out[13]:
  customer_id  Group
0         ABC      1
1         CDE      3
2         BHF      2
3         NID      1
4         WKL      3
5         SDI      2
6         JSK      1
7         OSM      3
8         MPA      2
9         MAD      1

In [14]: d = {1:[.5,.5], 2:[.4,.6], 3:[.2,.8]}

In [15]: df['Flag'] = \
    ...: df.groupby('Group')['customer_id'] \
    ...:   .transform(lambda x: np.random.choice(['Control','Test'], len(x), p=d[x.name]))
    ...:

In [16]: df
Out[16]:
  customer_id  Group     Flag
0         ABC      1  Control
1         CDE      3     Test
2         BHF      2     Test
3         NID      1  Control
4         WKL      3  Control
5         SDI      2     Test
6         JSK      1     Test
7         OSM      3     Test
8         MPA      2  Control
9         MAD      1     Test

answered Oct 03 '17 at 20:01

MaxU - stand with Ukraine

205,989
36
386
419

thank you MaxU, is there any good documentation online to learn more about this approach? – jeangelj Oct 03 '17 at 20:04
@jeangelj, what approach do you mean - `.transform()`, `lambda ...`, `np.random.choice`, something else? – MaxU - stand with Ukraine Oct 03 '17 at 20:08
lambda with the categories d[x.name] – jeangelj Oct 03 '17 at 20:08
@jeangelj, [here are some examples about dictionaries](https://www.python-course.eu/dictionaries.php) – MaxU - stand with Ukraine Oct 03 '17 at 20:16
I just finished testing, but for my first group, where I have 270 in total; I get a split of 138 Control vs. 132 Treatment instead of 135/135 – jeangelj Oct 03 '17 at 20:23
@jeangelj, yeah, `np.random` will not give you 100% exact distribution... – MaxU - stand with Ukraine Oct 03 '17 at 20:24
I see, would there be another approach that would? – jeangelj Oct 03 '17 at 20:25
@MaxU thank you! it seems that this cannot be accomplished with np.random – jeangelj Oct 04 '17 at 14:20
@MaxU, I am testing it and keep on getting this error ValueError: Length mismatch: Expected axis has 1281 elements, new values have 1282 elements – jeangelj Oct 16 '17 at 16:09

Dan Frank · Accepted Answer · 2017-10-05T15:39:00.557

It sounds like you're looking for a way to split your customer_id's into exact proportions, and not rely on chance. Here's one way to do that using pandas.qcut and np.random.permutation.

In [228]: df = pd.DataFrame({'customer_id': np.random.normal(size=10000), 
                             'group': np.random.choice(['a', 'b', 'c'], size=10000)})

In [229]: proportions = {'a':[.5,.5], 'b':[.4,.6], 'c':[.2,.8]}

In [230]: df.head()
Out[230]:
   customer_id group
0       0.6547     c
1       1.4190     a
2       0.4205     a
3       2.3266     a
4      -0.5691     b

In [231]: def assigner(gp):
     ...:     group = gp['group'].iloc[0]
     ...:     cut = pd.qcut(
                  np.arange(gp.shape[0]), 
                  q=np.cumsum([0] + proportions[group]), 
                  labels=range(len(proportions[group]))
              ).get_values()
     ...:     return pd.Series(cut[np.random.permutation(gp.shape[0])], index=gp.index, name='assignment')
     ...:

In [232]: df['assignment'] = df.groupby('group', group_keys=False).apply(assigner)

In [233]: df.head()
Out[233]:
   customer_id group  assignment
0       0.6547     c           1
1       1.4190     a           1
2       0.4205     a           0
3       2.3266     a           1
4      -0.5691     b           0

In [234]: (df.groupby(['group', 'assignment'])
             .size()
             .unstack()
             .assign(proportion=lambda x: x[0] / (x[0] + x[1])))
Out[234]:
assignment     0     1  proportion
group
a           1659  1658      0.5002
b           1335  2003      0.3999
c            669  2676      0.2000

What's going on here?

Within each group we call the function assigner
assigner grabs the group name and proportions from the predefined dictionary and calls pd.qcut to split into 0(control) 1(treatment)
np.random.permutation then shuffles the the assignments
Create this as a new column in the original dataframe

Thank you very much @Dan Frank, but it doesn't assign the control/treatment flag by group — jeangelj, Oct 04 '17 at 14:00
@jeangelj I think I misunderstood your question ... so you're looking for an X%, 1-X% split within group 1, Y%, 1-Y% within group 2, etc.? — Dan Frank, Oct 04 '17 at 23:54
@jeangelj I've edited the code sample above and I believe it now handles your case — Dan Frank, Oct 16 '17 at 16:29
Hi Dan, I am getting this error suddenly: ValueError: Bin edges must be unique: array([ 0, 0, 2621], dtype=int64). You can drop duplicate edges by setting the 'duplicates' kwarg — jeangelj, Feb 15 '18 at 21:13
@jeangelj check this answer, which has some details on the problem https://stackoverflow.com/questions/20158597/how-to-qcut-with-non-unique-bin-edges The problem is likely that one of your groups only has one customer in it — Dan Frank, Feb 17 '18 at 14:53

Assign control vs. treatment groupings randomly based on % for more than 2 groups

2 Answers2

Linked