I have a dataset for which when i try using the label encoder. fit_transform to the train data, i can't use transform() the validation data and i get an error because the test data has some labels that're previously unseen before (don't exist in the train data)
Now the quickest way to overcome this issue, is to fit to the WHOLE dataset then transform each of the train and test sets. But then there are two issues:
- potential data leakage
- Potential error if we used a new dataset that also has previously unseen labels
How you overcome this issue in real problems? Is it really wrong to fit the whole dataset or it can be done?