Is k-folds cross validation a smarter idea than using a validation set instead?

Question

I have a somewhat large (~2000) set of medical images I plan to use to train a CV model (using efficentnet architecture) in my workplace. In preparation for this, I was reading up on some good practices for training medical images. I have split the dataset by patients to prevent leakages and split my data in train:test:val in the order of 60:20:20. However, I read that k-folds cross validation was a newer practice then using a validation set, but I was recommended away from doing so as k-folds is supposed to be far more complicated. What would you recommend in this instance, and are there any other good practices to adopt?

SO is for programming questions, design questions should be asked on https://datascience.stackexchange.com/. In short, first it depends if you're going to do hyper-parameter tuning or train several models and select the best. If yes, you will need a fresh test set. If not, you can either have evaluate with a single train/test split or with CV, knowing that CV gives a more reliable estimate of performance but requires more computation. See also [this answer](https://datascience.stackexchange.com/questions/108792/why-is-the-k-fold-cross-validation-needed/108795#108795). — Erwan, May 24 '22 at 11:37

Scriddie · Accepted Answer · 2022-05-31T21:15:16.297

Common Practice

A train:test split with cross-validation on the training set is part of the standard workflow in many machine learning modules. For an example and further details, I recommend the excellent sklearn article on it.

Implementation

The implementation may be a little trickier but should not be prohibitive given the many code examples assuming you are using TF or Pytorch (see e.g. this SO question).

Should you be using k-fold cross validation?

Compared to a single validation set, k-fold cross-validation avoids over-fitting hyperparameters to a fixed validation set and makes better use of the available data by utilizing the entire training set across the folds, albeit at greater computational cost. Whether or not this makes a big difference depends on your task. 2000 images does not sound like a lot in computer vision terms, so making good use of the data may be relevant to you, especially if you plan on tuning hyperparameters.

Is k-folds cross validation a smarter idea than using a validation set instead?

1 Answers1

Common Practice

Implementation

Should you be using k-fold cross validation?