1

Suppose I have a data set with 1000 rows. I want to split it into train and test set. I want to split first 800 row into train set then rest 200 row into test set. Is it possible?

image of portion of sample data set

My python test code for train and test splitting is like this:

from sklearn.cross_validation import train_test_split

xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.20)
ZF007
  • 3,708
  • 8
  • 29
  • 48
  • If I am understanding correctly you don't want the shuffled which is the behaviour of train_test_split, if so are you using numpy or pandas, or something else? – anand_v.singh Feb 24 '19 at 17:34
  • Pandas. i just want to make sure that my first 800 data( sequentially) will in the train section then rest 200 is in the test section. – Al Amin Biswas Feb 24 '19 at 17:48

2 Answers2

0

There are multiple ways to do this, I will run by a few of them.

Slicing is a powerful method in python and accepts the arguments as data[start:stop:step] in your case if you just want the first 800 copies and your data frame is named as train for input features and Y for output features you can use

X_train = train[0:800]
X_test = train[800:]
y_train = Y[0:800]
y_test = Y[800:]

Iloc function is associated with a dataFrame and is associated with an Index, if your Index is numeric then you can use

X_train = train.iloc[0:800]
X_test = train.iloc[800:]
y_train = Y.iloc[0:800]
y_test = Y.iloc[800:]

If you just have to split the data into two parts, you can even use the df.head() and df.tail() to do it,

X_train = train.head(800)
X_test = train.tail(200)
y_train = Y.head(800)
y_test = Y.tail(200)

There are other ways to do it too, I would recommend using the first method as it is common across multiple datatypes and will also work if you were working with a numpy array. To learn more about slicing I would suggest that you checkout. Understanding slice notation here it is explained for a list, but it works with almost all forms.

anand_v.singh
  • 2,768
  • 1
  • 16
  • 35
0

You want to set shuffle= False.

from sklearn.cross_validation import train_test_split
xtrain, xtest, ytrain, ytest = train_test_split(x, y, test_size=0.20, shuffle = False) 
Penny
  • 23
  • 1
  • 4