How to split a big dataset into train, validation and testing sets

Question

I have a dataset with 30 classes, each class have different idx. I want to split this dataset into 70, 20, and 10%, train, validation and test sets respectively in python. can you please suggest me an idea how to write a code. I am new to coading.

Look into _scikit-learn_. Link -> http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html — BcK, Apr 20 '18 at 08:44
"I'am new to coding", then SO is not the place to ask for teaching, find some tutorials for beginners, SO is about helping peoples who tried but can't make it work for some reasons. One can't become a pro from one day to another, and surely not while skipping the basics. — N.K, Apr 20 '18 at 08:45
Possible duplicate of [How to split data into 3 sets (train, validation and test)?](https://stackoverflow.com/questions/38250710/how-to-split-data-into-3-sets-train-validation-and-test) — CentAu, Apr 15 '19 at 16:42

score 2 · Answer 1 · answered Apr 20 '18 at 08:44

2

You could use scikitlearn

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.80, random_state=42)

then split the train again to create the validation

answered Apr 20 '18 at 08:44

Samuel Muiruri

492
1
8
17

score 1 · Answer 2 · answered Nov 20 '20 at 20:00

The below code produces a 60%, 20%, 20% split for training, dev and test sets.

import numpy as np

train, dev, test = np.split(data.sample(frac=1), [int(.6*len(data)), int(.8*len(data))])

print("Train data is: ", train[:5], "\n\n", "Length of train data is: ", len(train), "\n")
print("Train data is: ", dev[:5], "\n\n", "Length of train data is: ", len(dev), "\n")
print("Train data is: ", test[:5], "\n\n", "Length of train data is: ", len(test))

How to split a big dataset into train, validation and testing sets

2 Answers2