In my data set, I have a categorical feature called product.
Let's say in the training set, its values are in {"apple", "banana", "durian", "orange",....}
. On the other hand, In test set, the values now can be {"banana", "orange", pineapple"}
. There are some values that do not have in the training set (e.g., pineapple).
I know that if we have all possible values in advance, we can create a Label Encoder variable, and fit
it with all the values that the feature can have. But in this case, I can not ensure that the training set can cover all the values in the test set (i.e., when some new products appear).
It makes a big concern to me because I'm afraid that when using Label Encoding, the training set can be mapped as {"apple": 1, "banana": 2, "durian": 3, "orange": 4, ... (thousands more) }, but when it comes to mapping on the test set, we're gonna get {"banana": 1, "orange":2 , pineapple":3}.
My questions are:
- Does it have a negative impact on classification model ? For example, if apple becomes an important value in the
product
feature, as far as I know, the model will treat 1 (the numeric value of apple) with more concern. Is it misleading when 1 isbanana
in the test set ?
- Does it have a negative impact on classification model ? For example, if apple becomes an important value in the
- Is there any way that I can deal with kind of label encoder problems in which have different values on training and test set ?
I found some relevant links like this one, but it's not exactly my problem.
Update: Please note that the product
can have thousands of values, that's why I use Label Encoder here rather than One Hot Coding.