1

I have a dataset for which when i try using the label encoder. fit_transform to the train data, i can't use transform() the validation data and i get an error because the test data has some labels that're previously unseen before (don't exist in the train data)

Now the quickest way to overcome this issue, is to fit to the WHOLE dataset then transform each of the train and test sets. But then there are two issues:

  1. potential data leakage
  2. Potential error if we used a new dataset that also has previously unseen labels

How you overcome this issue in real problems? Is it really wrong to fit the whole dataset or it can be done?

  • Possible duplicate https://stackoverflow.com/questions/21057621/sklearn-labelencoder-with-never-seen-before-values. – Nikhil Kumar Mar 18 '21 at 04:59
  • I’m voting to close this question because it is not about programming as defined in the [help] but about ML theory and/or methodology - please see the intro and NOTE in the `machine-learning` [tag info](https://stackoverflow.com/tags/machine-learning/info). – desertnaut Mar 18 '21 at 10:02

0 Answers0