Validation test - can it be the average of training set?

Question

I am training an ANN in some biological experimental data. Briefly, my input dataset (features) consists of gene levels (RNA expression levels) of different samples (cell lines). In this dataset, I have replicates of the same biological sample, meaning that I have measured twice (or more times) the RNA expression levels of the same cell line or cell lines that are meant to be the same. I have included all different measurements (different cell lines, different measurements of the same cell line etc.) as different samples in the training set in order to increase the flexibility of the ANN, instead of calculating the average and using only that (for the different measurements of the same cell line).

I was wondering whether I can use this average of different measurements of the same cell lines as my validation test - what do you think? It's a regression ANN and the labels are protein structures.

desertnaut · Accepted Answer · 2020-04-22T19:26:46.690

You cannot do that.

The key idea behind validation (and test) sets is that they must consist exclusively of unseen data; and here this is not the case, since the data used for your averages have already been seen during training.

There have been lots of horror stories in the past (including research papers!) from people naively thinking that they can include their validation/test sets in their feature selection process, as long as they don't use them for fitting their models. They were hurt badly. For some cases, see my blog post How NOT to perform feature selection!; for a simple reproducible example in Python of what can go wrong in such a case (tl;dr: everything), see own answer in Should Feature Selection be done before Train-Test Split or after?

The second key (but often implicit) idea is that your validation/test set must be qualitatively similar to your training data, i.e. theoretically they must both come from the same data-generating probability distribution. And arguably the distribution of your individual samples is not the same with the distribution of their average values.

The second requirement is quite interesting indeed and something that I did not think about. I will read the article and the thread with interest. — PK1617, Apr 23 '20 at 05:44

Validation test - can it be the average of training set?

1 Answers1