2

I have a dataframe which I want to groupby by grp1, grp2 columns and then make random samples from every group based on the column how_many.

This is my sample data

   grp1  grp2  how_many    val
0     a     1         2  2993
1     a     1         2  8244
2     a     2         1  7148
3     a     1         2  5326
4     a     3         2  5577
5     a     3         2  5651
6     a     1         2  6297
7     a     2         1  2657
8     a     2         1  9774
9     a     1         2  4075
10    a     3         2  6780
11    b     1         1  1765
12    b     1         1  5592
13    b     1         1  9936
14    b     2         4  4324
15    b     2         4  6823
16    b     2         4  9184
17    b     2         4  7498
18    b     2         4  3810

This is the expected result (random of course):

  grp1  grp2  how_many    val
0    a     1         2  2993
1    a     1         2  5326
2    a     2         1  9774
3    a     3         2  6780
4    a     3         2  5651
5    b     1         1  5592
6    b     2         4  6823
7    b     2         4  9184
8    b     2         4  7498
9    b     2         4  3810

My approach was to follow these instructions, however, in my case, I do not have a stable sample size, it varies based on a column value.

I also tried to use multi_index on groupby columns, but got an error saying that MemoryError: Unable to allocate 107. GiB for an array with shape (57244869081,) and data type int16. It is just a small sample of my data.

Any help would be appreciated

jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
Emil Mirzayev
  • 242
  • 2
  • 13

1 Answers1

3

Both solution working if same values in column how_many per groups.

Create lambda function with DataFrame.sample and call per groups:

df1= (df.groupby(['grp1','grp2'])
        .apply(lambda x: x.sample(x['how_many'].iat[0]))
        .reset_index(drop=True))
print (df1)
  grp1  grp2  how_many   val
0    a     1         2  2993
1    a     1         2  6297
2    a     2         1  9774
3    a     3         2  5651
4    a     3         2  5577
5    b     1         1  1765
6    b     2         4  6823
7    b     2         4  4324
8    b     2         4  7498
9    b     2         4  9184

Solution out of box for improve performance:

First create random all rows by frac=1 in DataFrame.sample, then create counter by GroupBy.cumcount and compare by Series.le for <=, last if necessary sorting by groups:

df1 = df.sample(frac=1)
df1 = df1[df1.groupby(['grp1','grp2']).cumcount().add(1).le(df1['how_many'])]
df1 = df1.sort_values(['grp1','grp2'])
print (df1)
   grp1  grp2  how_many   val
0     a     1         2  2993
1     a     1         2  8244
2     a     2         1  7148
5     a     3         2  5651
4     a     3         2  5577
13    b     1         1  9936
16    b     2         4  9184
17    b     2         4  7498
15    b     2         4  6823
14    b     2         4  4324

Details:

First are randomly changed order of columns:

df1 = df.sample(frac=1)
print (df1)
   grp1  grp2  how_many   val
15    b     2         4  6823
18    b     2         4  3810
10    a     3         2  6780
11    b     1         1  1765
14    b     2         4  4324
2     a     2         1  7148
16    b     2         4  9184
7     a     2         1  2657
9     a     1         2  4075
1     a     1         2  8244
13    b     1         1  9936
17    b     2         4  7498
8     a     2         1  9774
4     a     3         2  5577
12    b     1         1  5592
0     a     1         2  2993
5     a     3         2  5651
6     a     1         2  6297
3     a     1         2  5326

Then is created counter per groups and filtered by mask - compared by column how_many are returned first rows, but becuse randoch changed order are returned random rows:

print (df1.assign(counter=df1.groupby(['grp1','grp2']).cumcount().add(1), 
                  mask=df1.groupby(['grp1','grp2']).cumcount().add(1).le(df1['how_many'])))
   grp1  grp2  how_many   val  counter   mask
15    b     2         4  6823        1   True
18    b     2         4  3810        2   True
10    a     3         2  6780        1   True
11    b     1         1  1765        1   True
14    b     2         4  4324        3   True
2     a     2         1  7148        1   True
16    b     2         4  9184        4   True
7     a     2         1  2657        2  False
9     a     1         2  4075        1   True
1     a     1         2  8244        2   True
13    b     1         1  9936        2  False
17    b     2         4  7498        5  False
8     a     2         1  9774        3  False
4     a     3         2  5577        2   True
12    b     1         1  5592        3  False
0     a     1         2  2993        3  False
5     a     3         2  5651        3  False
6     a     1         2  6297        4  False
3     a     1         2  5326        5  False

Filtering:

df1 = df1[df1.groupby(['grp1','grp2']).cumcount().add(1).le(df1['how_many'])]
print (df1)
   grp1  grp2  how_many   val
15    b     2         4  6823
18    b     2         4  3810
10    a     3         2  6780
11    b     1         1  1765
14    b     2         4  4324
2     a     2         1  7148
16    b     2         4  9184
9     a     1         2  4075
1     a     1         2  8244
4     a     3         2  5577

If necessary, sorting:

df1 = df1.sort_values(['grp1','grp2'])
print (df1)
   grp1  grp2  how_many   val
9     a     1         2  4075
1     a     1         2  8244
2     a     2         1  7148
10    a     3         2  6780
4     a     3         2  5577
11    b     1         1  1765
15    b     2         4  6823
18    b     2         4  3810
14    b     2         4  4324
16    b     2         4  9184
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    Nice lambda and sample! – Celius Stingher Jun 01 '20 at 12:37
  • 1
    Thank you for the answer, can you please explain maybe in inline comments the second solution? it seems as a vectorized solution that should outperform the first one. Because I am intending to do this random sampling around 1k times to get my confidence intervals in the result. – Emil Mirzayev Jun 01 '20 at 14:58