Shuffling one Column of a DataFrame By Group Efficiently

Question

I am trying to implement a permutation test on a large Pandas dataframe. The dataframe looks like the following:

    group  some_value  label
0       1           8      1
1       1           7      0
2       1           6      2
3       1           5      2
4       2           1      0
5       2           2      0
6       2           3      1
7       2           4      2
8       3           2      1
9       3           4      1
10      3           2      1
11      3           4      2

I want to group by column group, and shuffle the label column and write back to the data frame, preferably in place. The some_value column should remain intact. The result should look something like the following:

    group  some_value  label
0       1           8      1
1       1           7      2
2       1           6      2
3       1           5      0
4       2           1      1
5       2           2      0
6       2           3      0
7       2           4      2
8       3           2      1
9       3           4      2
10      3           2      1
11      3           4      1

I used np.random.permutation but found it was very slow.

df["label"] = df.groupby("group")["label"].transform(np.random.permutation

It seems that df.sample is much faster. How can I solve this problem using df.sample() instead of np.random.permutation, and inplace?

What's wrong with `df['label'] = df.groupby("group")["label"].transform(pd.Series.sample, frac=1)`? Doesn't appear to be faster so I must be missing something... — Bill, Jul 19 '19 at 06:57

score 0 · Accepted Answer · answered Jul 19 '19 at 00:16

0

We can using sample Notice this is assuming df=df.sort_values('group')

df['New']=df.groupby('group').label.apply(lambda x : x.sample(len(x))).values

Or we can do it by

df['New']=df.sample(len(df)).sort_values('group').New.values

answered Jul 19 '19 at 00:16

BENY

317,841
20
164
234

Thanks! Do both methods assume that the dataframe is sorted by group? – Ryan R. Rosario Jul 19 '19 at 00:28
@RyanR.Rosario yep that is assumed, but you can always adding df.sort_index(), get back the original order – BENY Jul 19 '19 at 00:32

score 0 · Answer 2 · answered Jul 19 '19 at 00:29

What about providing a custom transform function?

def sample(x):
    return x.sample(n=x.shape[0])

df.groupby("group")["label"].transform(sample)

This SO explanation of printing out what is passed into the custom function via the transform function is helpful.

Shuffling one Column of a DataFrame By Group Efficiently

2 Answers2