I may be missing something but after following for quite a long time now the suggestion (of some senior data scientists) to LabelEncoder().fit
only to training data and not also to test data then I start to think why is this really necessary.
Specifically, at SkLearn
if I want to LabelEncoder().fit
only to training data then there are two different scenarios:
The test set has some new labels in relation to the training set. For example, the test set has only the labels
['USA', 'UK']
while the test set has the labels['USA', 'UK', 'France']
. Then, as it has been reported elsewhere (e.g. Getting ValueError: y contains new labels when using scikit learn's LabelEncoder), you are getting an error if you try to transform the test set according to thisLabelEncoder()
because exactly it encounters a new label.The test set has the same labels as the training set. For example, both the training and the test set have the labels
['USA', 'UK', 'France']
. However, thenLabelEncoder().fit
only to training data is essentially redundant since the test set have the same known values as the training set.
Hence, what is the point of LabelEncoder().fit
only to training data and then LabelEncoder().tranform
both the training and the test data if at case (1) this throws an error and if at case (2) it is redundant?
Let my clarify that the (pretty knowledgeable) senior data scientists whom I have seen to LabelEncoder().fit
only to training data, they had justified this by saying that the test set should be entirely new to even the simplest model like an encoder and it should not be mixed at any fitting with the training data. They did not mention anything about any production or out-of-vocabulary purposes.