I want to split a Dataframe into 4 parts with stratified sampling. Make sure all categories form column 'B' Should present in each chunk. If any category is not having sufficient records for all chunks, copy same record into remaining chunks.
df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo',
'foo', 'bar', 'foo', 'bar',
'foo', 'bar', 'foo', 'foo', 'bar'],
'B' : ['one', 'one', 'two', 'three',
'two', 'two', 'one', 'three',
'one', 'one', 'two', 'three',
'two', 'two', 'one', 'three', 'four'],
'C' : np.random.randn(17), 'D' : np.random.randn(17)})
print(df)
A B C D
0 foo one 0.960627 0.318723
1 bar one 0.269439 -0.945565
2 foo two 0.210376 0.765680
3 bar three -0.375095 -1.617334
4 foo two -1.910716 -0.532117
5 bar two -0.277426 0.019717
6 foo one -0.260074 1.384464
7 foo three 0.072119 -1.077725
8 foo one 0.093446 -0.683513
9 bar one -0.154885 -1.453996
10 foo two -1.258207 1.406615
11 bar three -0.003332 -0.083092
12 foo two 1.250562 0.519337
13 bar two -0.837681 -1.465363
14 foo one -0.403992 -0.133496
15 foo three -0.757623 -0.459532
16 bar four -2.071840 0.802953
Output should be like below (All categories from 'B' column should present in each chunk. Index doesn't matter)
A B C D
0 foo one 0.200466 -0.394136
2 foo two 0.086008 -0.528286
3 bar three -1.979613 -1.345405
8 foo one -1.195563 -0.832880
15 foo three -0.737060 -0.437047
16 bar four -2.071840 0.802953
A B C D
1 bar one 1.177119 0.693766
4 foo two 0.452803 -0.595433
7 foo three 1.285687 1.107021
12 foo two 1.746976 1.449390
16 bar four -2.071840 0.802953
A B C D
6 foo one -0.095485 0.129541
5 bar two 0.803417 -0.219461
7 foo three 1.285687 1.107021
13 bar two 1.166246 -1.711505
16 bar four -2.071840 0.802953
A B C D
9 bar one 2.001238 -0.283411
10 foo two 0.865580 0.052533
11 bar three -0.437604 -0.652073
14 foo one -0.655985 -0.942792
16 bar four -2.071840 0.802953