-1

I have a problem where there are 20 classes. I have designed a neural network and using the loss as categorical_crossentropy.

When dealing with categorical cross entropy the output label must be one hot encoded.

So, when I one hot encoded the output label, the label in every row was one hot encoded in a matrix, while in label encoder I got the same encoding in an array.

oht = OneHotEncoder()
y_train_oht = oht.fit_transform(np.array(y_train).reshape(-1,1))

below is the snippet of label encoding

le = LabelEncoder()
y_train_le = le.fit_transform(y_train)
y_train_le_cat = to_categorical(y_train_le)

one hot encoding sample output one hot encoding

label encoding sample output label encoding

I find the one hot encoding gives a matrix while label encoding gives an array. Can I please know when one hot encoding does the same job why do we have a label encoder. What kind of optimization does the label encoder bring in?

If using the label encoder happens to be more optimal then why do we not use the label encoder to encode categorical input data instead of one hot encoding?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
sarah
  • 1
  • 2

1 Answers1

1

Label encoding imposes artificial order: if you label-encode your pet target as 'Dog':0, 'Cat':1, 'Turtle':2, 'Golden Fish':3, then you get the awkward situation where 'Dog' < 'Cat' and 'Turtle is the average of 'Cat' + 'Golden Fish'.

In the case of predictor features (not the target), this is a problem since your Random Forest can be learning something like "if it less than 'Turtle', then...".

Also, you may have categories in the testing set (or even worse, new data during deployment) that were not present in the training, and the transformer doesn't know what to do, so it throws an error. This may be the case or not depending on the particular problem and particular feature you are encoding, obviously not for the target variable.

When hot encoding, if a category absent in the training is present in a prediction, it just get encoded as 0 in each of the encoded features (new columns representing each category), so you don't get an error. Your model still has the other features to make a reasonable guess.

As a general rule, you want to use label encoding for target variables and OHE for predictor features. Note that in general you don't care about artificial order in the target, since the prediction is usually categorical also (A forest will choose a number, not a range of numbers; a network will have one activation unit per category...)

I don't think optimization should be part of the discussion here since they are used for different scenarios demanding different outputs: surely it's more efficient to use the OHE transformer than trying to hack it by performing label encoding and then some pandas trickery to create the same result as with one hot encoding.

Here there are useful comments about the different scenarios (type of model, type of data) and some issues related to efficiency.

Here there's an example on why label encoding is a bad practice for input features.

And let's not forget that the goal of the model is to make predictions, so at the end what's important is not just the output of <transformer>.fit_transform, but also the fitted transformer itself that's going to be applied to the new observations. OHE will deal with new cases differently than label-encoder (e.g. when the value of the feature in the observation was not present in the training set). That's in my opinion enough reason to have different methods, even when they act in a way similar enough so, for some inputs, you may be able to force them to give similar outputs.

Ignatius Reilly
  • 1,594
  • 2
  • 6
  • 15