Is it necessary to split data into three; train, val and test?

Question

Here the difference between test, train and validation set is described. In most documentation on training neural networks, I find that these three sets are used, however they are often predefined.

I have a relatively small data set (906 3D images in total, the distribution is balanced). I'm using sklearn.model_selection.train_test_split function to split the data in train and test set and using X_test and y_test as validation data in my model.

X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.2, random_state=1)
...
history = AD_model.fit(
    X_train, 
    y_train, 
    batch_size=batch_size,
    epochs=100,
    verbose=1,
    validation_data=(X_test, y_test))

After training, I evaluate the model on the test set:

test_loss, test_acc = AD_model.evaluate(X_test,  y_test, verbose=2)

I've seen other people also approach it this way, but since the model has already seen this data, I'm not sure what the consequences are of this approach. Can someone tell me what the consequences are of using the same set for validation and testing? And since I already have a small data set (with overfitting as a result), is it necessary to split the data in 3 sets?

The consequence is that your accuracy will be higher than it would be with "unknown" data, because the model will "recognize" the data and correctly categorize it with a larger probability. — nostradamus, Jan 09 '20 at 10:32

score 3 · Accepted Answer · answered Jan 09 '20 at 10:28

3

You can use train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))]) it produces a 60%, 20%, 20% split for training, validation and test sets.

Hope it's Helpfull Thank you for reading!!

answered Jan 09 '20 at 10:28

Radha Manohar

409
3
15

score 1 · Answer 2 · answered Jan 09 '20 at 12:30

Validation set can be used for the following:

Monitor the performance of your on-going training model on data that was not part of the training set. This can help you verify that your model is correctly training and not over-fitting.
Select hyper-parameters that give you best performance.
Select best snapshot/weights or stopping epoch based on the validation metrics.

Having the same set for both validation and test will prevent you from comparing your model to any other method on the same data in an un-biased way since the model's hyperparemeters (and stopping criteria) was selected to maximize performance on this set. It will also make your results a bit optimistic since a validation set (on which the model was selected) is probably easier than an unseen test set.

emremrah · Answer 3 · 2020-01-09T12:49:14.837

This is what I do:

Split the data into %80 train and %20 test sets
With train data, do 5 fold cross-validation. Note that the train set will also be splitted again %80-20 due to cross validation in each fold, but CV modules does it themselves (e.g. sklearn's cross validation). So you don't have to split it again manually
After each fold, evaluate the model using test set
After 5 folds, by mean and std of CV scores, decide the model's accuracy. You can also add classification reports, confusion matrixes, loss and accuracy plots etc.

The reason of using train, validation and test set is in one fold, the model trains itself using train data, optimizes itself using validation data, and at the end of training I test the model using test data.

This is why using a complete seperate test set is good to decide if the model's accuracy is satisfying enough: the model optimizes itself by using error of evaluating validation data. If you evaluate it using validation data again, it's not fair, because the model kinda seen it before.

For your situation, if you can manage to split the data equally (same amount of each class in the test set), yes it's still good to split to 3 sets.

Is it necessary to split data into three; train, val and test?

3 Answers3