7

I want to partition a pandas DataFrame into ten disjoint, equally-sized, randomly composed subsets.

I know I can randomly sample one tenth of the original pandas DataFrame using:

partition_1 = pandas.DataFrame.sample(frac=(1/10))

However, how can I obtain the other nine partitions? If I'd do pandas.DataFrame.sample(frac=(1/10)) again, there exists the possibility that my subsets are not disjoint.

Thanks for the help!

Tomas
  • 315
  • 1
  • 3
  • 13
  • This already has been answered: just combine [this](http://stackoverflow.com/a/17315875/2077270) with [this](http://stackoverflow.com/a/15772356/2077270) – dermen Jul 25 '16 at 14:59

3 Answers3

4

Starting with this.

 dfm = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',  'foo', 'bar', 'foo', 'foo']*2,
                      'B' : ['one', 'one', 'two', 'three', 'two', 'two', 'one', 'three']*2}) 

     A      B
0   foo    one
1   bar    one
2   foo    two
3   bar  three
4   foo    two
5   bar    two
6   foo    one
7   foo  three
8   foo    one
9   bar    one
10  foo    two
11  bar  three
12  foo    two
13  bar    two
14  foo    one
15  foo  three

Usage: 
Change "4" to "10", use [i] to get the slices.  

np.random.seed(32) # for reproducible results.
np.array_split(dfm.reindex(np.random.permutation(dfm.index)),4)[1]
      A    B
2   foo  two
5   bar  two
10  foo  two
12  foo  two

np.array_split(dfm.reindex(np.random.permutation(dfm.index)),4)[3]

     A      B
13  foo    two
11  bar  three
0   foo    one
7   foo  three
Merlin
  • 24,552
  • 41
  • 131
  • 206
2

use np.random.permutations :

df.loc[np.random.permutation(df.index)]

it will shuffle the dataframe and keep column names, after you could split the dataframe into 10.

SerialDev
  • 2,777
  • 20
  • 34
2

Say df is your dataframe, and you want N_PARTITIONS partitions of roughly equal size (they will be of exactly equal size if len(df) is divisible by N_PARTITIONS).

Use np.random.permutation to permute the array np.arange(len(df)). Then take slices of that array with step N_PARTITIONS, and extract the corresponding rows of your dataframe with .iloc[].

import numpy as np

permuted_indices = np.random.permutation(len(df))

dfs = []
for i in range(N_PARTITIONS):
    dfs.append(df.iloc[permuted_indices[i::N_PARTITIONS]])

Since you are on Python 2.7, it might be better to switch range(N_PARTITIONS) by xrange(N_PARTITIONS) to get an iterator instead of a list.

Alicia Garcia-Raboso
  • 13,193
  • 1
  • 43
  • 48