I train myself by making projects in Machine Learning and Deep Learning areas. To do so, I register to Kaggle competitions such as the Titanic Dataset.
When we load the data, there are two datasets: the train and the test. For now I am performing analysis on only the train set, and everytime I create a new feature or perform some changes, I do a loop over the two datasets by making the same operation.
Now, I am about to impute missing values, and perform some preprocessing operations, so, I'll use some aggregations on the data, encode the categorical features, etc etc. But I wonder if I should use both the training and test sets to compute the mean or label the features or just the training set.
Because as far as I understood, the test set is supposed to measure how well a model perform on Data it has never seen so I assumed until now I should only use the training set to make decisions.
But sometimes it could be "wrong", for example how handle the fact that maybe the testset has new categories the trainingset doesn't have ?
Questions
When filling missing values and perform preprocessing operations in Deep Learing or Machine Learning projects, is it better to use both the training set and testset or just the training set?
Even If it would be better in Kaggle competitions, what about a production project ? Maybe we should consider the case that new data has the possiblity to have unseen categories ?