0

I have a dataframe with 23000 instances, but I want to split it such that I have one df with 3000 values and another with 20000 values. I tried using ilocbut when I do df.iloc[:, :20000] it produces no usable result.

tushariyer
  • 906
  • 1
  • 10
  • 20

2 Answers2

3

I would recommend using scikit-learns train_test_split for a random sample split (using .iloc is just going to split along the index, this is unlikely to be a representative split between train and test).

Something like this:

import pandas as pd

from sklearn.model_selection import train_test_split

df = pd.DataFrame(data = np.random.random((23000, 4)), columns = ['X1', 'X2', 'X3', 'Y'])

train, test = train_test_split(df, test_size = 3000)
Brad Solomon
  • 38,521
  • 31
  • 149
  • 235
George Crowther
  • 548
  • 5
  • 16
2

You need testing_df = df.iloc[:20000].

Think of iloc's arguments as referencing [rows, columns].

Using df.iloc[:, :20000] as you currently have returns all rows and the first 20,000 columns, which will just be a copy of df unless you currently have > 20,000 columns.

See also: Selection by position.

Brad Solomon
  • 38,521
  • 31
  • 149
  • 235