Sampling a DataFrame in n groups

Question

I have a DataFrame like this:

    a          b    c
0   0   0.326783    1
1   1   0.356272    1
2   2   0.797407    1
3   3   0.098846    1
4   4   0.528812    1
5   5   0.913114    1
6   6   0.630039    2
7   7   0.475828    2
8   8   0.619713    2
9   9   0.756735    2
10  10  0.168544    2
11  11  0.337957    3
12  12  0.201395    3
13  13  0.272564    3
14  14  0.757490    3
15  15  0.032135    4
16  16  0.598143    4
17  17  0.150696    4
18  18  0.001403    4
19  19  0.427624    4

Then, I want to sample it, randomly, in 3 subgoups, given their proportions (ex.[0.5, 0.3, 0.2], but respecting the proportion of labels in column c

I tried a recursion with df.groupby('c').sample(frac=...), sampling one group, and then another, etc...

The problem is that one group didn't get a label c=3

What is the best way of doing it, respecting both given proportions of the subgroups (my [0.5, 0.3, 0.2] list mentioned above) and also proportions of label c inside each of the sampled subgroups?

This is called a ***(weighted) stratified split***, e.g. [`sklearn.model_selection.train_test_split()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). Are you using sklearn? pytorch? etc. Most ML packages have a stratified split. — smci, Jul 29 '21 at 18:12

score 1 · Answer 1 · edited Jul 29 '21 at 17:55

1

You should be able to use the weights parameter of the group sample method. This give a weight to reach row. Just use the group size as weight

df.groupby('c').sample(frac=0.2, weights=df.groupby('c')['c'].transform(len))

N.B. I couldn't run the code to test, but you get the general idea

edited Jul 29 '21 at 17:55

Henry Yik

22,275
4
18
40

answered Jul 29 '21 at 17:51

mozway

194,879
13
39
75

Sampling a DataFrame in n groups

1 Answers1