I have a dataframe with 23000 instances, but I want to split it such that I have one df with 3000 values and another with 20000 values. I tried using iloc
but when I do df.iloc[:, :20000]
it produces no usable result.
Asked
Active
Viewed 139 times
0

tushariyer
- 906
- 1
- 10
- 20
-
1you may want to check [this question + answers](https://stackoverflow.com/questions/38250710/how-to-split-data-into-3-sets-train-validation-and-test) – MaxU - stand with Ukraine Oct 17 '17 at 20:18
-
1@MaxU Dupe seems ripe enough to close. – cs95 Oct 17 '17 at 20:18
2 Answers
3
I would recommend using scikit-learns train_test_split
for a random sample split (using .iloc
is just going to split along the index, this is unlikely to be a representative split between train and test).
Something like this:
import pandas as pd
from sklearn.model_selection import train_test_split
df = pd.DataFrame(data = np.random.random((23000, 4)), columns = ['X1', 'X2', 'X3', 'Y'])
train, test = train_test_split(df, test_size = 3000)

Brad Solomon
- 38,521
- 31
- 149
- 235

George Crowther
- 548
- 5
- 16
2
You need testing_df = df.iloc[:20000]
.
Think of iloc
's arguments as referencing [rows, columns]
.
Using df.iloc[:, :20000]
as you currently have returns all rows and the first 20,000 columns, which will just be a copy of df
unless you currently have > 20,000 columns.
See also: Selection by position.

Brad Solomon
- 38,521
- 31
- 149
- 235