13

I've looked at a few tutorials to crack into Keras for deep learning using Convolutional Neural Networks. In the tutorial (and in Keras' official documentation), the MNIST dataset is loaded like so:

from keras.datasets import mnist

(x_train, y_train), (x_test, y_test) = mnist.load_data()

However, no explanation is offered as to why we have two tuples of data. My question is: what are x_train and y_train and how do they differ from their x_test and y_test counterparts?

Kenny Worden
  • 4,335
  • 11
  • 35
  • 62
  • I don't know if the content of the subsets is different, but one is for training and the other is for testing. You want to use different data for testing to make sure you aren't overfitting. EDIT: As to why they are separated in this way vs. it all coming together and you just slice it yourself, I don't know. – Elliot Roberts Sep 29 '17 at 18:50
  • 1
    Possible duplicate of [What's is the difference between train, validation and test set, in neural networks?](https://stackoverflow.com/questions/2976452/whats-is-the-difference-between-train-validation-and-test-set-in-neural-netwo) – fuglede Sep 29 '17 at 18:52

2 Answers2

32

The training set is a subset of the data set used to train a model.

  • x_train is the training data set.
  • y_train is the set of labels to all the data in x_train.

The test set is a subset of the data set that you use to test your model after the model has gone through initial vetting by the validation set.

  • x_test is the test data set.
  • y_test is the set of labels to all the data in x_test.

The validation set is a subset of the data set (separate from the training set) that you use to adjust hyperparameters.

  • The example you listed doesn't mention the validation set.

I've made a Deep Learning with Keras playlist on Youtube. It contains the basics for getting started with Keras, and a couple of the videos demo how to organize images into train/valid/test sets, as well as how to get Keras to create a validation set for you. Seeing this implementation may help you get a firmer grasp on how these different data sets are used in practice.

blackHoleDetector
  • 2,975
  • 2
  • 13
  • 13
1

The end goal of all machine learning algorithms is to generalize to new data. if you create a model based on all the data you have you would not have a metric on how your model performs on new data. To solve this problem we will normally split out train data in to three parts, train data set, development / tuning data set and test data set. lets us take example of splitting data into two parts, train and test. In this case you would first split your data into 60/70/80 % train and 40/30/20 test and apply 10 fold cross validation and grid search which would be also helpful for tuning. Mind you till this time you were training and tuning on you train data(you would never touch you test data during tuning phase not even looking at its distributions or anything). Once you got your model generated you would run it on test data and get performance of your model on test data. This would act as performance metric of your model on unknown data.

prudhvi Indana
  • 789
  • 7
  • 19