1

I have a dataset with 30 classes, each class have different idx. I want to split this dataset into 70, 20, and 10%, train, validation and test sets respectively in python. can you please suggest me an idea how to write a code. I am new to coading.

  • the dataset consists with RGB images – Ramireddy Devaram Apr 20 '18 at 08:38
  • 1
    Look into _scikit-learn_. Link -> http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html – BcK Apr 20 '18 at 08:44
  • "I'am new to coding", then SO is not the place to ask for teaching, find some tutorials for beginners, SO is about helping peoples who tried but can't make it work for some reasons. One can't become a pro from one day to another, and surely not while skipping the basics. – N.K Apr 20 '18 at 08:45
  • Possible duplicate of [How to split data into 3 sets (train, validation and test)?](https://stackoverflow.com/questions/38250710/how-to-split-data-into-3-sets-train-validation-and-test) – CentAu Apr 15 '19 at 16:42

2 Answers2

2

You could use scikitlearn

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.80, random_state=42)

then split the train again to create the validation

Samuel Muiruri
  • 492
  • 1
  • 8
  • 17
1

The below code produces a 60%, 20%, 20% split for training, dev and test sets.

import numpy as np

train, dev, test = np.split(data.sample(frac=1), [int(.6*len(data)), int(.8*len(data))])

print("Train data is: ", train[:5], "\n\n", "Length of train data is: ", len(train), "\n")
print("Train data is: ", dev[:5], "\n\n", "Length of train data is: ", len(dev), "\n")
print("Train data is: ", test[:5], "\n\n", "Length of train data is: ", len(test))
mpriya
  • 823
  • 8
  • 15