Optimal split for training, validation and testing sets

Question

I initially thought that a good rule of thumb to split training, validation and test data is 60-20-20. However, the top answer here seems to suggest a 80:20 with training and test, and then take 20% of that 80% for your validation data (which amounts to taking a validation split of 0.2 for example with Keras's model.fit()). However, this is not a 60-20-20 in that case -- here the size of the test data is clearly larger than the size of the validation data.

For example, if there are in total 100 samples, and 80% is taken for training, that leaves 80 samples for training, with the other 20% for testing with 20 samples.

If you take 20% of that 80%, you instead take 20% of 80, which is 16. This would overall imply an overal split of 64%-16%-20% for training, validation and testing respectively.

Is this still correct/fine/a good rule of thumb? Or should I instead take 20% of the total from the 80% for testing -- such that in this case I'm taking 25% of the training data so that 20 samples are alloted to the validation set, and I now have a 60-20-20 split?

For whatever is more appropriate/standard practice, why is that? Is there a standard, conventional choice for one or the other?

score 0 · Answer 1 · answered Jun 23 '21 at 21:25

0

The end goal of everything is to increase model accuracy...and the splitting depends on how many instances you have...if you are able to get better accuracy with your way of splitting then you can use it but it will not bring a drastic change...! Mainly everything depends on type of data you are dealing with, how large it is or how many instances are there.

answered Jun 23 '21 at 21:25

Siddharth Mishra

1
1

So the split doesn't matter as long as it produces good model accuracy? – genjong Jun 24 '21 at 13:52
just use the standard split you'll be fine.....focus more on EDA and feature engineering. – Siddharth Mishra Jul 02 '21 at 07:29

Optimal split for training, validation and testing sets

1 Answers1