Split dataframe into testing_df and validation_df

Question

I have a dataframe with 23000 instances, but I want to split it such that I have one df with 3000 values and another with 20000 values. I tried using ilocbut when I do df.iloc[:, :20000] it produces no usable result.

you may want to check [this question + answers](https://stackoverflow.com/questions/38250710/how-to-split-data-into-3-sets-train-validation-and-test) — MaxU - stand with Ukraine, Oct 17 '17 at 20:18

score 3 · Answer 1 · edited Oct 17 '17 at 20:15

I would recommend using scikit-learns train_test_split for a random sample split (using .iloc is just going to split along the index, this is unlikely to be a representative split between train and test).

Something like this:

import pandas as pd

from sklearn.model_selection import train_test_split

df = pd.DataFrame(data = np.random.random((23000, 4)), columns = ['X1', 'X2', 'X3', 'Y'])

train, test = train_test_split(df, test_size = 3000)

Brad Solomon · Accepted Answer · 2017-10-17T20:00:56.623

2

You need testing_df = df.iloc[:20000].

Think of iloc's arguments as referencing [rows, columns].

Using df.iloc[:, :20000] as you currently have returns all rows and the first 20,000 columns, which will just be a copy of df unless you currently have > 20,000 columns.

Split dataframe into testing_df and validation_df

2 Answers2