I have a dataframe which I want to groupby by grp1, grp2
columns and then make random samples from every group based on the column how_many
.
This is my sample data
grp1 grp2 how_many val
0 a 1 2 2993
1 a 1 2 8244
2 a 2 1 7148
3 a 1 2 5326
4 a 3 2 5577
5 a 3 2 5651
6 a 1 2 6297
7 a 2 1 2657
8 a 2 1 9774
9 a 1 2 4075
10 a 3 2 6780
11 b 1 1 1765
12 b 1 1 5592
13 b 1 1 9936
14 b 2 4 4324
15 b 2 4 6823
16 b 2 4 9184
17 b 2 4 7498
18 b 2 4 3810
This is the expected result (random of course):
grp1 grp2 how_many val
0 a 1 2 2993
1 a 1 2 5326
2 a 2 1 9774
3 a 3 2 6780
4 a 3 2 5651
5 b 1 1 5592
6 b 2 4 6823
7 b 2 4 9184
8 b 2 4 7498
9 b 2 4 3810
My approach was to follow these instructions, however, in my case, I do not have a stable sample size, it varies based on a column value.
I also tried to use multi_index
on groupby columns, but got an error saying that MemoryError: Unable to allocate 107. GiB for an array with shape (57244869081,) and data type int16
. It is just a small sample of my data.
Any help would be appreciated