0

I have a DataFrame like this:

    a          b    c
0   0   0.326783    1
1   1   0.356272    1
2   2   0.797407    1
3   3   0.098846    1
4   4   0.528812    1
5   5   0.913114    1
6   6   0.630039    2
7   7   0.475828    2
8   8   0.619713    2
9   9   0.756735    2
10  10  0.168544    2
11  11  0.337957    3
12  12  0.201395    3
13  13  0.272564    3
14  14  0.757490    3
15  15  0.032135    4
16  16  0.598143    4
17  17  0.150696    4
18  18  0.001403    4
19  19  0.427624    4

Then, I want to sample it, randomly, in 3 subgoups, given their proportions (ex.[0.5, 0.3, 0.2], but respecting the proportion of labels in column c

I tried a recursion with df.groupby('c').sample(frac=...), sampling one group, and then another, etc...

The problem is that one group didn't get a label c=3

What is the best way of doing it, respecting both given proportions of the subgroups (my [0.5, 0.3, 0.2] list mentioned above) and also proportions of label c inside each of the sampled subgroups?

Rafael Higa
  • 655
  • 1
  • 8
  • 17
  • This is called a ***(weighted) stratified split***, e.g. [`sklearn.model_selection.train_test_split()`](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). Are you using sklearn? pytorch? etc. Most ML packages have a stratified split. – smci Jul 29 '21 at 18:12

1 Answers1

1

You should be able to use the weights parameter of the group sample method. This give a weight to reach row. Just use the group size as weight

df.groupby('c').sample(frac=0.2, weights=df.groupby('c')['c'].transform(len))

N.B. I couldn't run the code to test, but you get the general idea

Henry Yik
  • 22,275
  • 4
  • 18
  • 40
mozway
  • 194,879
  • 13
  • 39
  • 75