I have a dataset of images consisting of three splits - the training, validation and test splits, and want to normalize the dataset to make training easier. Hence I want to find the mean and standard deviation of RGB values from the available data.
The doubt I have is - should I consider all the splits for normalizing?
My personal thought is that only the training split should be used since it is assumed to be the only data that we have to train the model. Hence the model is provided inputs from the distribution of the training data, leaving errors that can be picked by evaluation on the validation split. If I provide the distribution to a network from data outside what is provided for training, would it not be feeding the network extra information than what it is supposed to learn from?
Any other way to do this would also be of help. For example, is it just better to use standard values for RGB?
transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225))
(Soure: Pytorch Torchvision Transforms)