5

How do I split a dataframe into multiple dataframes where each dataframe contains equal but random data? It is not based on a specific column.

For instance, I have one 100 rows and 30 columns in a dataframe. I want to divide this data into 5 lots. I should have 20 records in each of the dataframe with same 30 columns and there is no duplication across all the 5 lots and the way I pick the rows should be random.. I don't want the random picking on a single column.

One way I thought I will use index and numpy and divide them into lots and use that to split the dataframe. Wanted to see if someone has an easy and pandas way of doing it.

Anil K
  • 137
  • 1
  • 10

2 Answers2

8

If you do not care about the new dataframes potentially containing some of the same information, you could use sample where frac specifies the fraction of the dataframe that you desire

df1 = df.sample(frac=0.5) # df1 is now a random sample of half the dataframe

EDIT:

If you want to avoid duplicates, you can use shuffle from sklearn

from sklearn.utils import shuffle

df = shuffle(df)
df1 = df[0:3]
df2 = df[3:6]
1

Depending on your need, you could use pandas.DataFrame.sample() to randomly sample your original data frame, df.

df1 = df.sample(n=3) 
df2 = df.sample(n=3)

gives you two subsets, each with 3 samples. Equal number of records and random.

SimplySnee
  • 13
  • 3