0

I train myself by making projects in Machine Learning and Deep Learning areas. To do so, I register to Kaggle competitions such as the Titanic Dataset.

When we load the data, there are two datasets: the train and the test. For now I am performing analysis on only the train set, and everytime I create a new feature or perform some changes, I do a loop over the two datasets by making the same operation.

Now, I am about to impute missing values, and perform some preprocessing operations, so, I'll use some aggregations on the data, encode the categorical features, etc etc. But I wonder if I should use both the training and test sets to compute the mean or label the features or just the training set.

Because as far as I understood, the test set is supposed to measure how well a model perform on Data it has never seen so I assumed until now I should only use the training set to make decisions.

But sometimes it could be "wrong", for example how handle the fact that maybe the testset has new categories the trainingset doesn't have ?

Questions

  1. When filling missing values and perform preprocessing operations in Deep Learing or Machine Learning projects, is it better to use both the training set and testset or just the training set?

  2. Even If it would be better in Kaggle competitions, what about a production project ? Maybe we should consider the case that new data has the possiblity to have unseen categories ?

SmileyProd
  • 788
  • 4
  • 13
  • Better to keep the test set away, so as to match the real time scenarios. what imputation tecnhique you are using ? – venkata krishnan Jul 26 '19 at 08:30
  • One exemple: To impute the missing age values, I will compute the mean or median for the relevant group of the passenger that have its age missing. The group depends on which ticket class he bought, the title of its name, etc etc. So for the testset, I should compute this mean by using only training data ? – SmileyProd Jul 26 '19 at 08:36
  • @agupta is right - this is the general golden rule of modeling; both answers in [Should Feature Selection be done before Train-Test Split or after?](https://stackoverflow.com/questions/56308116/should-feature-selection-be-done-before-train-test-split-or-after) (disclaimer: mine), although in a different context than yours (feature selection), may be helpful. – desertnaut Jul 26 '19 at 08:45
  • You **never** touch the test data when you're training a model. – agupta Jul 26 '19 at 08:47

2 Answers2

2

1) You never touch the test data when you're training a model. Test sets are only for checking the accuracy of your predictions.

2) Generally, we hope that the training data has all possible outcomes(hence the need for a larger data source and Kaggle does provide a substantially large dataset so you don't have to worry about it), as far as production is concerned and the unseen circumstances in such cases you tend towards improving your model so that it can tackle these newer cases. This might involve re-training it.

agupta
  • 175
  • 2
  • 14
  • 1
    Fully agree (+1); for a more detailed exposition, plus an actual coding example of how things can go wrong when one mixes their training & test sets during data preparation, my 2 answers in [Should Feature Selection be done before Train-Test Split or after?](https://stackoverflow.com/questions/56308116/should-feature-selection-be-done-before-train-test-split-or-after) might be helpful. – desertnaut Jul 26 '19 at 08:49
  • Thank you for your more detailed answer and @desertnaut for your link. When I said that I performed some changes on the testset, I think I wasn't clear enough, I meant that if in the training set I notice I can extract a feature, I do the same on the testset. For example if I see on the training set that the ages of people can be grouped in 4 different ranges, by labeling the ranges. And I do this for the training and testset. Am I not supposed to do that ? Because I don't see how my models could be tested if the features are not the same for the training and test sets. – SmileyProd Jul 26 '19 at 08:55
  • 1
    @SmileyProd this is something different, and of course you should do that - in fact, as you have already suspected, you cannot do otherwise. agupta's and mine arguments here say that you should not actively *use* the test set in order to guide your modeling pipeline. What you say here comes only at the end, after having finalized your model, and before feeding your test set to it, and it is of course not only valid but mandatory – desertnaut Jul 26 '19 at 08:58
  • Perfect, thank you very much, this is what I first thought, but by looking at other Kernels, I have seen a lot of people that concatenate the dataframes during all the modeling pipeline and at the end split them again, so I thought that I maybe I was missing something. – SmileyProd Jul 26 '19 at 09:02
0

The training set should be used to build your machine learning models. For the training set, we provide the outcome (also known as the “ground truth”) for each passenger. Your model will be based on “features” like passengers’ gender and class. You can also use feature engineering to create new features.

The test set should be used to see how well your model performs on unseen data. For the test set, we do not provide the ground truth for each passenger. It is your job to predict these outcomes. For each passenger in the test set, use the model you trained to predict whether or not they survived the sinking of the Titanic.

So the clue of the topic is: If the test set is known before data, that would mean that you will have the results at the time of prediction. You don't change this, because then you would only be comparing the results, and not predicting it.

agupta
  • 175
  • 2
  • 14
  • So the answer would be to perform the data preprocessing on the testset by only using the trainingset ? That would mean that if I perform a onehotencoder on the training data for one feature, and that feature has a new category on the testset, the onehotencoder would put only zeros for the rows concerned ? For me it's okay because this is what a testset is supposed to do for real world application, but I wonder if I would not just lose rank in competitions by losing information. – SmileyProd Jul 26 '19 at 08:41
  • You don't need one hot encoding variables in test set, because they dont give u any valuable information. For example, in titanic challange you need to predict that someone will survive or not, so from testset you only need that information to compare results – mld_drp Jul 26 '19 at 08:47
  • But If I perform a onehotencoder on the training set, I have to do the same for the testset so that my model can make predictions, right ? Otherwise my two data sets wouldn't have the same number of features, so I don't see how my model would be able to perform any computation on the testset. – SmileyProd Jul 26 '19 at 08:57