2

In my data set, I have a categorical feature called product.

Let's say in the training set, its values are in {"apple", "banana", "durian", "orange",....}. On the other hand, In test set, the values now can be {"banana", "orange", pineapple"}. There are some values that do not have in the training set (e.g., pineapple).

I know that if we have all possible values in advance, we can create a Label Encoder variable, and fit it with all the values that the feature can have. But in this case, I can not ensure that the training set can cover all the values in the test set (i.e., when some new products appear).

It makes a big concern to me because I'm afraid that when using Label Encoding, the training set can be mapped as {"apple": 1, "banana": 2, "durian": 3, "orange": 4, ... (thousands more) }, but when it comes to mapping on the test set, we're gonna get {"banana": 1, "orange":2 , pineapple":3}.

My questions are:

    1. Does it have a negative impact on classification model ? For example, if apple becomes an important value in the product feature, as far as I know, the model will treat 1 (the numeric value of apple) with more concern. Is it misleading when 1 is banana in the test set ?
    1. Is there any way that I can deal with kind of label encoder problems in which have different values on training and test set ?

I found some relevant links like this one, but it's not exactly my problem.

Update: Please note that the product can have thousands of values, that's why I use Label Encoder here rather than One Hot Coding.

Chau Pham
  • 4,705
  • 1
  • 35
  • 30

2 Answers2

3

You have to use one hot encoding when feeding the categorical variables into the ML models. Otherwise model will have to treat apple < banana < durian < orange, which actually is not the case.

For the unknown values coming up during the test dataset, all the columns for that variable will be zero, which eventually make the model understand that this value is not seen during the training time.

X= [["apple"], ["banana"], ["durian"], ["orange"]]
from sklearn.preprocessing import OneHotEncoder
enc = OneHotEncoder(handle_unknown='ignore')
enc.fit(X)

enc.categories_

categories:

[array(['apple', 'banana', 'durian', 'orange'], dtype=object)]

During test data,

enc.transform([["banana"], ["orange"], ["pineapple"]]).toarray()

output:

array([[0., 1., 0., 0.],
       [0., 0., 0., 1.],
       [0., 0., 0., 0.]])
Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
  • 1
    Thanks. I got your idea. In my case, the `product` feature can have **thousands of values**, so I think One Hot Encoding will create too many columns, which in turn is difficult for the model to deal with. In some classification models like **Decision Tree**, when we transform category to number like Label Encoder method, the model can still make a split, which mean it's still useful, am I right ? – Chau Pham Dec 12 '18 at 03:53
  • It could handle. One better approach could be, pick the useful categories after creating one hot vector using sklearn.feature_selection, then use that in your final model – Venkatachalam Dec 12 '18 at 04:01
2

If i was in your position, I will use a dictionary for the training data. Same dictionary will be used in test data too. There might be case where test data have some value/word that train data did not encountered. I will use a special index named as unknown token for those cases. Therefore my dictionary would be: {"UNK":0,apple": 1, "banana": 2, "durian": 3, "orange": 4}

Then for test data {"banana, orange , pineapple"}, I will have {2,4,0}

I hope that will be useful.

  • Thank you for the useful advice. I will try it. It sounds reasonable :). How about the first question. Do you think using different label encoder will make the classification model confused ? – Chau Pham Dec 12 '18 at 03:41
  • 1
    I think yes. because the encoding done by training and encoding in testing have different dictionaries. So that will predict wrong value/label. Another thing : What you can do is sort your training dictionary according to frequency of the words. And then limit the dictionary into a reasonable size. Lets say you cut down the words from the dictionary that appear less than 3. This will improve your accuracy of classification. Because infrequent words cause data sparsity. – C M Khaled Saifullah Dec 12 '18 at 03:47