Groupby and random sample based on a column value

Question

I have a dataframe which I want to groupby by grp1, grp2 columns and then make random samples from every group based on the column how_many.

This is my sample data

   grp1  grp2  how_many    val
0     a     1         2  2993
1     a     1         2  8244
2     a     2         1  7148
3     a     1         2  5326
4     a     3         2  5577
5     a     3         2  5651
6     a     1         2  6297
7     a     2         1  2657
8     a     2         1  9774
9     a     1         2  4075
10    a     3         2  6780
11    b     1         1  1765
12    b     1         1  5592
13    b     1         1  9936
14    b     2         4  4324
15    b     2         4  6823
16    b     2         4  9184
17    b     2         4  7498
18    b     2         4  3810

This is the expected result (random of course):

  grp1  grp2  how_many    val
0    a     1         2  2993
1    a     1         2  5326
2    a     2         1  9774
3    a     3         2  6780
4    a     3         2  5651
5    b     1         1  5592
6    b     2         4  6823
7    b     2         4  9184
8    b     2         4  7498
9    b     2         4  3810

My approach was to follow these instructions, however, in my case, I do not have a stable sample size, it varies based on a column value.

I also tried to use multi_index on groupby columns, but got an error saying that MemoryError: Unable to allocate 107. GiB for an array with shape (57244869081,) and data type int16. It is just a small sample of my data.

Any help would be appreciated

no, it is not. There the sample size is standard, 8000 to be precise. In my case, it is based on column values — Emil Mirzayev, Jun 01 '20 at 12:25

jezrael · Accepted Answer · 2020-06-01T15:03:06.993

Both solution working if same values in column how_many per groups.

Create lambda function with DataFrame.sample and call per groups:

df1= (df.groupby(['grp1','grp2'])
        .apply(lambda x: x.sample(x['how_many'].iat[0]))
        .reset_index(drop=True))
print (df1)
  grp1  grp2  how_many   val
0    a     1         2  2993
1    a     1         2  6297
2    a     2         1  9774
3    a     3         2  5651
4    a     3         2  5577
5    b     1         1  1765
6    b     2         4  6823
7    b     2         4  4324
8    b     2         4  7498
9    b     2         4  9184

Solution out of box for improve performance:

First create random all rows by frac=1 in DataFrame.sample, then create counter by GroupBy.cumcount and compare by Series.le for <=, last if necessary sorting by groups:

df1 = df.sample(frac=1)
df1 = df1[df1.groupby(['grp1','grp2']).cumcount().add(1).le(df1['how_many'])]
df1 = df1.sort_values(['grp1','grp2'])
print (df1)
   grp1  grp2  how_many   val
0     a     1         2  2993
1     a     1         2  8244
2     a     2         1  7148
5     a     3         2  5651
4     a     3         2  5577
13    b     1         1  9936
16    b     2         4  9184
17    b     2         4  7498
15    b     2         4  6823
14    b     2         4  4324

Details:

First are randomly changed order of columns:

df1 = df.sample(frac=1)
print (df1)
   grp1  grp2  how_many   val
15    b     2         4  6823
18    b     2         4  3810
10    a     3         2  6780
11    b     1         1  1765
14    b     2         4  4324
2     a     2         1  7148
16    b     2         4  9184
7     a     2         1  2657
9     a     1         2  4075
1     a     1         2  8244
13    b     1         1  9936
17    b     2         4  7498
8     a     2         1  9774
4     a     3         2  5577
12    b     1         1  5592
0     a     1         2  2993
5     a     3         2  5651
6     a     1         2  6297
3     a     1         2  5326

Then is created counter per groups and filtered by mask - compared by column how_many are returned first rows, but becuse randoch changed order are returned random rows:

print (df1.assign(counter=df1.groupby(['grp1','grp2']).cumcount().add(1), 
                  mask=df1.groupby(['grp1','grp2']).cumcount().add(1).le(df1['how_many'])))
   grp1  grp2  how_many   val  counter   mask
15    b     2         4  6823        1   True
18    b     2         4  3810        2   True
10    a     3         2  6780        1   True
11    b     1         1  1765        1   True
14    b     2         4  4324        3   True
2     a     2         1  7148        1   True
16    b     2         4  9184        4   True
7     a     2         1  2657        2  False
9     a     1         2  4075        1   True
1     a     1         2  8244        2   True
13    b     1         1  9936        2  False
17    b     2         4  7498        5  False
8     a     2         1  9774        3  False
4     a     3         2  5577        2   True
12    b     1         1  5592        3  False
0     a     1         2  2993        3  False
5     a     3         2  5651        3  False
6     a     1         2  6297        4  False
3     a     1         2  5326        5  False

Filtering:

df1 = df1[df1.groupby(['grp1','grp2']).cumcount().add(1).le(df1['how_many'])]
print (df1)
   grp1  grp2  how_many   val
15    b     2         4  6823
18    b     2         4  3810
10    a     3         2  6780
11    b     1         1  1765
14    b     2         4  4324
2     a     2         1  7148
16    b     2         4  9184
9     a     1         2  4075
1     a     1         2  8244
4     a     3         2  5577

If necessary, sorting:

df1 = df1.sort_values(['grp1','grp2'])
print (df1)
   grp1  grp2  how_many   val
9     a     1         2  4075
1     a     1         2  8244
2     a     2         1  7148
10    a     3         2  6780
4     a     3         2  5577
11    b     1         1  1765
15    b     2         4  6823
18    b     2         4  3810
14    b     2         4  4324
16    b     2         4  9184

Thank you for the answer, can you please explain maybe in inline comments the second solution? it seems as a vectorized solution that should outperform the first one. Because I am intending to do this random sampling around 1k times to get my confidence intervals in the result. — Emil Mirzayev, Jun 01 '20 at 14:58

Groupby and random sample based on a column value

1 Answers1