4

I want to split a Dataframe into 4 parts with stratified sampling. Make sure all categories form column 'B' Should present in each chunk. If any category is not having sufficient records for all chunks, copy same record into remaining chunks.

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                             'foo', 'bar', 'foo', 'foo',
                         'foo', 'bar', 'foo', 'bar',
                             'foo', 'bar', 'foo', 'foo', 'bar'],
                       'B' : ['one', 'one', 'two', 'three',
                             'two', 'two', 'one', 'three',
                             'one', 'one', 'two', 'three',
                             'two', 'two', 'one', 'three', 'four'],
                       'C' : np.random.randn(17), 'D' : np.random.randn(17)})

print(df)

      A      B         C         D
0   foo    one  0.960627  0.318723
1   bar    one  0.269439 -0.945565
2   foo    two  0.210376  0.765680
3   bar  three -0.375095 -1.617334
4   foo    two -1.910716 -0.532117
5   bar    two -0.277426  0.019717
6   foo    one -0.260074  1.384464
7   foo  three  0.072119 -1.077725
8   foo    one  0.093446 -0.683513
9   bar    one -0.154885 -1.453996
10  foo    two -1.258207  1.406615
11  bar  three -0.003332 -0.083092
12  foo    two  1.250562  0.519337
13  bar    two -0.837681 -1.465363
14  foo    one -0.403992 -0.133496
15  foo  three -0.757623 -0.459532
16  bar   four -2.071840  0.802953

Output should be like below (All categories from 'B' column should present in each chunk. Index doesn't matter)

     A      B         C         D
0   foo    one  0.200466 -0.394136
2   foo    two  0.086008 -0.528286
3   bar  three -1.979613 -1.345405
8   foo    one -1.195563 -0.832880
15  foo  three -0.737060 -0.437047
16  bar   four -2.071840  0.802953

     A      B         C         D
1   bar    one  1.177119  0.693766
4   foo    two  0.452803 -0.595433
7   foo  three  1.285687  1.107021
12  foo    two  1.746976  1.449390
16  bar   four -2.071840  0.802953

     A      B         C         D
6   foo    one -0.095485  0.129541
5   bar    two  0.803417 -0.219461
7   foo  three  1.285687  1.107021
13  bar    two  1.166246 -1.711505
16  bar   four -2.071840  0.802953

     A      B         C         D
9   bar    one  2.001238 -0.283411
10  foo    two  0.865580  0.052533
11  bar  three -0.437604 -0.652073
14  foo    one -0.655985 -0.942792
16  bar   four -2.071840  0.802953
Siddeshwar
  • 73
  • 1
  • 8

1 Answers1

1

This may help: df1, df2, df3, df4 = np.array_split(x_train, 4) from: Split large Dataframe into smaller equal dataframes

maximus
  • 335
  • 2
  • 16