2

I may be missing something but after following for quite a long time now the suggestion (of some senior data scientists) to LabelEncoder().fit only to training data and not also to test data then I start to think why is this really necessary.

Specifically, at SkLearn if I want to LabelEncoder().fit only to training data then there are two different scenarios:

  1. The test set has some new labels in relation to the training set. For example, the test set has only the labels ['USA', 'UK'] while the test set has the labels ['USA', 'UK', 'France']. Then, as it has been reported elsewhere (e.g. Getting ValueError: y contains new labels when using scikit learn's LabelEncoder), you are getting an error if you try to transform the test set according to this LabelEncoder() because exactly it encounters a new label.

  2. The test set has the same labels as the training set. For example, both the training and the test set have the labels ['USA', 'UK', 'France']. However, then LabelEncoder().fit only to training data is essentially redundant since the test set have the same known values as the training set.

Hence, what is the point of LabelEncoder().fit only to training data and then LabelEncoder().tranform both the training and the test data if at case (1) this throws an error and if at case (2) it is redundant?

Let my clarify that the (pretty knowledgeable) senior data scientists whom I have seen to LabelEncoder().fit only to training data, they had justified this by saying that the test set should be entirely new to even the simplest model like an encoder and it should not be mixed at any fitting with the training data. They did not mention anything about any production or out-of-vocabulary purposes.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Outcast
  • 4,967
  • 5
  • 44
  • 99

2 Answers2

4

The main reason to do so is because in inference/production time (not testing) you might encounter labels that you have never seen before (and you won't be able to call fit() even if you wanted to).

In scenario 2 where you are guaranteed to always have the same labels across folds and in production it is indeed redundant. But are you still guaranteed to see the same in production?

In scenario 1 you need to find a solution to handle unknown labels. One popular approach is map every unknown label into an unknown token. In natural language processing this is call the "Out of vocabulary" problem and the above approach is often used.

To do so and still use LabelEncoder() you can pre-process your data and perform the mapping yourself.

gidim
  • 2,314
  • 20
  • 23
  • Thanks for the post which is pretty helpful. Please have a look at the end of my edited post. – Outcast Sep 11 '18 at 16:20
  • Indeed model fit methods should never have access to test data. I tried to explain "why" is it the case in scenarios where you're clearly not leaking information (scenario 2). Either way the "out of vocabulary" problem is just an example for this if you want to explore further. – gidim Sep 11 '18 at 16:41
  • Ok your approach certainly makes sense if you have to deal with an extensive vocabulary where you cannot anticipate every word of it. However, if you only have categorical variables such as days (Monday, Tuesday etc) then is there any problem in fitting the `LaberEncoder` also to the test data which are simply days again (Monday, Tuesday) as the training data are? – Outcast Sep 28 '18 at 13:34
  • 1
    Sure there's some use cases where it might be "ok" but really it's a matter of best practice. If you want to build reliable models and not risk leaking test data into training just don't do it. Also think about the next person reading your code. Will they be clear that this is a "safe" case? – gidim Sep 28 '18 at 20:30
1

It's hard to guess why the senior data scientists gave you that advice without context, but I can think of at least one reason they may have had in mind.

If you are in the first scenario, where the training set does not contain the full set of labels, then it is often helpful to know this and so the error message is useful information.

Random sampling can often miss rare labels and so taking a fully random sample of all of your data is not always the best way to generate a training set. If France does not appear in your training set, then your algorithm will not be learning from it, so you may want to use a randomisation method that ensures your training set is representative of minority cases. On the other hand, using a different randomisation method may introduce new biases.

Once you have this information, it will depend on your data and problem to be solved as to what the best approach to solve it will be, but there are cases where it is important to have all labels present. A good example would be identifying the presence of a very rare illness. If your training data doesn't include the label indicating that the illness is present, then you better re-sample.

Andrew McDowell
  • 2,860
  • 1
  • 17
  • 31