Why should LabelEncoder from sklearn be used only for the target variable?

Question

I was trying to create a pipeline with a LabelEncoder to transform categorical values.

cat_variable = Pipeline(steps = [
    ('imputer',SimpleImputer(strategy = 'most_frequent')),
    ('lencoder',LabelEncoder())
])
                        
num_variable = SimpleImputer(strategy = 'mean')

preprocess = ColumnTransformer (transformers = [
    ('categorical',cat_variable,cat_columns),
    ('numerical',num_variable,num_columns)
])

odel = RandomForestRegressor(n_estimators = 100, random_state = 0)

final_pipe = Pipeline(steps = [
    ('preprocessor',preprocess),
    ('model',model)
])

scores = -1 * cross_val_score(final_pipe,X_train,y,cv = 5,scoring = 'neg_mean_absolute_error')

But this is throwing a TypeError:


TypeError: fit_transform() takes 2 positional arguments but 3 were given

On further reference, I found out that transformers like LabelEncoders are not supposed to be used with features and should only be used on the prediction target.

From Documentation:

class sklearn.preprocessing.LabelEncoder

Encode target labels with value between 0 and n_classes-1.

This transformer should be used to encode target values, i.e. y, and not the input X.

My question is, why can we not use LabelEncoder on feature variables and are there any other transformers that have a condition like this?

An ordinal encoding is not a good choice for a feature as you are giving it an artificial implied ordering. What is the cardinality of your categorical? If it's not too high, one hot encoding is the most common choice, although it's not great for tree based models especially when cardinality is high. Here's an entire package of alternatives: http://contrib.scikit-learn.org/category_encoders/ — Dan, Jul 14 '20 at 09:35

score 3 · Answer 1 · answered Jul 14 '20 at 09:58

LabelEncoder can be used to normalize labels or to transform non-numerical labels. For the input categorical you should use OneHotEncoder.

The difference:

le = preprocessing.LabelEncoder()
le.fit_transform([1, 2, 2, 6])
array([0, 0, 1, 2])

enc = OneHotEncoder(handle_unknown='ignore')
enc.fit_transform([[1], [2], [2], [6]]).toarray()
array([[1., 0., 0.],
       [0., 1., 0.],
       [0., 1., 0.],
       [0., 0., 1.]])

amiola · Answer 2 · 2022-01-31T09:19:10.557

LabelEncoder, by design, has to be used on the target variable and not on feature variables. This implies that the signature of methods .fit(), .transform() and .fit_transform() of the LabelEncoder class differs from the one of the transformers which are meant to be applied on features.

fit(y) vs fit(X[,y]) | transform(y) vs transform(X) | fit_transform(y) vs fit_transform(X[,y]) or similarly

fit(self, y) vs fit(self, X, y=None) | transform(self, y) vs transform(self, X) | fit_transform(self, y) vs fit_transform(self, X, y=None)

respectively for LabelEncoder-like transformers (i.e. transformers to be applied on target) and for transformers to be applied on features.

This same design also holds for LabelBinarizer and MultiLabelBinarizer. I would suggest the reading of the Transforming the prediction target (y) paragraph of the User Guide.

This said, here are a couple of considerations describing what happens when you try to use LabelEncoder in a Pipeline or in a ColumnTransformer:

Pipelines and ColumnTransformers are about transforming and fitting data, not targets. They somehow "assume" the target is already in a state that the estimator can use.
Within this github issue and the ones referenced in it you can follow the long-standing discussion about making it possible to enable pipelines to transform the target, too. This is also summarized within this sklearn FAQ.
The specific reason for which you're getting TypeError: fit_transform() takes 2 positional arguments but 3 were given is the following (here seen from the perspective of a ColumnTransformer): when calling either .fit_transform() or .fit() on the ColumnTransformer istance, method ._fit_transform() is called in turn on X and y, and it triggers the call of ._fit_transform_one() and here the error arises. Indeed, it calls .fit_transform() on the transformer istance (your LabelEncoder); here the different method signature comes into play:
```
 with _print_elapsed_time(message_clsname, message):
     if hasattr(transformer, "fit_transform"):
         res = transformer.fit_transform(X, y, **fit_params)
     else:
         res = transformer.fit(X, y, **fit_params).transform(X)
```
Indeed, .fit_transform() is called on (self, X, y) ([...] 3 arguments were given) while expecting (self, y) only ([...] takes 2 positional arguments). Following the code within the Pipeline class, it can be seen that the same happens.
As already specified, an alternative to label-encoding applicable on feature variables (and therefore in pipelines and column transformers) is the OrdinalEncoder (from version 0.20). At this proposal, I would suggest the reading of Difference between OrdinalEncoder and LabelEncoder.

score 1 · Answer 3 · edited Mar 26 '22 at 17:43

1

You can use OrdinalEncoder for categorical variables.

edited Mar 26 '22 at 17:43

answered Jan 29 '22 at 11:30

Gurjote Singh Sandhu

11
1

1

Hi! It is usually expected that you provide some more detailed explanation on how to solve a problem and usually also include some code to better help the person asking the question. Could you enhance your answer a bit this way? – palsch Jan 30 '22 at 22:43

Why should LabelEncoder from sklearn be used only for the target variable?

3 Answers3

Linked