Handling label encoding, transformation and estimation in one object

Question

Similar : Pipeline doesn't work with Label Encoder

I'd like to have an object that handles label encoding (in my case with a LabelEncoder), transformation and estimation. It is important to me that all theses functions can be executed through only one object.

I've tried using a pipeline this way :

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder

# mock training dataset
X = np.random.rand(1000, 100)
y = np.concatenate([["label1"] * 300, ["label2"] * 300, ["label3"] * 400])

le = LabelEncoder()
ss = StandardScaler()
clf = MyClassifier()
pl = Pipeline([('encoder', le),
               ('scaler', ss),
               ('clf', clf)])
pl.fit(X, y)

Which gives :

File "sklearn/pipeline.py", line 581, in _fit_transform_one
    res = transformer.fit_transform(X, y, **fit_params)
TypeError: fit_transform() takes exactly 2 arguments (3 given)

Clarifications :

X and y are my training dataset, X being the values and y the targeted labels.
X is a numpy.ndarray of shape (n_sample, n_features) and of type float, values ranging from 0 to 1.
y is a numpy.ndarray of shape (n_sample,) and of type string
I expect LabelEncoder to encode y, not X.
I need y only for MyClassifier, and I need it encoded to integers for MyClassifier to work.

After some thoughts and facing the error above, I feel like it was naive to think that Pipeline could handle it. I figured out that Pipeline could very well handle my transformation and classifier together but it was the label encoding part that would fail.

What is the correct way to achieve what I want ? By correct I mean to do something that would allow reusability and some kind of consistency with sklearn. Is there a class in sklearn library that do what I want ?

I'm pretty surprised I haven't found an answer browsing the web because I feel like what I'm doing is nothing uncommon. I might be missing something here.

What is that you want to encode, X or y? Likely y, but please confirm. Notice however that you are passing both to LabelEncoder.fit_transform(), X as first value and y as second. LabelEncoder.fit_transform() accepts [only one](http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.LabelEncoder.html#sklearn.preprocessing.LabelEncoder.fit_transform) input array, so it is not clear what happens to `y`, hence the error. Also, what type of data are you feeding as input? Is it a numpy array or a pandas dataframe? If you give a mock version of X and y I can write you a solution. — Daneel R., Sep 13 '18 at 10:02
Yes, I want to encode y with `LabelEncoder`, not X. X is a `numpy.ndarray` of shape (n_sample, n_features) and of type float, y is a `numpy.ndarray` of shape (n_sample,) and of type string. I need y for `MyClassifier`. `StandardScaler` doesn't need y but accept it and ignore it, only processing on X. `LabelEncoder` doesn't accept 2 parameters as you stated. I will edit my question to add these clarification. — 021, Sep 13 '18 at 12:03
as for mock version of X and y, this should do the trick : `X = np.random.rand(1000, 100)` and `y = np.concatenate([["label1"] * 300, ["label2"] * 300, ["label3"] * 400])` — 021, Sep 13 '18 at 15:10
Great, I'll work something out. I can anticipate you that you will still need to pull LabelEncoder out of the pipeline, no way around it. Could you add the mock input to the Question, so others can find it? — Daneel R., Sep 13 '18 at 15:33
Yes, I need the LabelEncoder out of the Pipeline. I added mock input in the question as well. Thanks. — 021, Sep 13 '18 at 15:52
LabelEncoder will be automatically called on `y` when you call `clf.fit()`. So you dont need to worry about it. `y` can have integers, strings as classes, that will be handled correctly by the estimators in scikit. So there is no need to include LabelEncoder in the pipeline to work on `y`. — Vivek Kumar, Sep 17 '18 at 13:14

score 1 · Accepted Answer · answered Oct 23 '18 at 09:32

As Vivek Kumar wrote in the comments :

LabelEncoder will be automatically called on y when you call clf.fit(). So you dont need to worry about it. y can have integers, strings as classes, that will be handled correctly by the estimators in scikit. So there is no need to include LabelEncoder in the pipeline to work on y.

So here's a solution to my problem :

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# mock training dataset
X = np.random.rand(1000, 100)
y = np.concatenate([["label1"] * 300, ["label2"] * 300, ["label3"] * 400])

ss = StandardScaler()
clf = MyClassifier()  # my own classifier
pl = Pipeline([('scaler', ss),
               ('clf', clf)])
pl.fit(X, y)

Only difference, now pl.predict(X) will return an array of strings containing the values "label1", "label2" or "label3" (which makes sense, since that's what we fed him with).

If needed, to get back the LabelEncoder that is used automatically by sklearn.pipeline, you can do :

from sklearn.preprocessing import LabelEncoder
le = LabelEncoder(pl.classes_)

Which would give me a copy of the label encoder used by the Pipeline pl.

score 0 · Answer 2 · answered Oct 22 '18 at 16:07

I believe this isn't possible.

Firstly all transformers inherit from the sklearn.base.TransformerMixin. The fit_transform method takes X and optionally y arguments, but only returns X_new. scikit-learn isn't designed with this kind of transformation in mind.

Secondly, LabelEncoder would fail in a pipeline because fit and transform only accept one argument, y, not X, y.

In the end I wrote a function to do lookup in an Enum mapping string labels to integer labels. At least then the transformation is in code and trackable using version control.

Thanks for the answer. That's what I ended up thinking as well. However, Vivek Kumar pointed out something useful in the comments, that LabelEncoder owuld be automatically called if `y` is not integers. Knowing that, I'll write an answer using this little trick. — 021, Oct 23 '18 at 09:22

score -1 · Answer 3 · answered Sep 13 '18 at 15:55

-1

I have implemented the categorical encoding with pandas, and as a classifier I have used SGDClassifier. Your code above calls MyClassifier(), but it is not defined inside the code itself.

import numpy as np
import pandas as pd
# from sklearn.preprocessing import LabelEncoder # No longer used
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.linear_model import SGDClassifier

X = np.random.randn(1000, 10)

y_initial = np.concatenate([["label1"] * 300, ["label2"] * 300, ["label3"] * 400])

df = pd.DataFrame({'y':y_initial})
df['y'] = df['y'].astype('category') # Same as the output of LabelEncoder

ss = StandardScaler()
clf = SGDClassifier()

y = df['y']

pl = Pipeline([('scaler', ss),
               ('clf', clf)])

pl.fit(X,y)

The output is the fit pipeline object:

Pipeline(memory=None,
     steps=[('scaler', StandardScaler(copy=True, with_mean=True, with_std=True)), ('clf', SGDClassifier(alpha=0.0001, average=False, class_weight=None, epsilon=0.1,
       eta0=0.0, fit_intercept=True, l1_ratio=0.15,
       learning_rate='optimal', loss='hinge', max_iter=None, n_iter=None,
       n_jobs=1, penalty='l2', power_t=0.5, random_state=None,
       shuffle=True, tol=None, verbose=0, warm_start=False))])

answered Sep 13 '18 at 15:55

Daneel R.

527
3
9

Thank you, I can see that I did not emphasize enough the fact that I'd like to have all these functions in **one object only**, for conceptions purposes. Here the decoding of the predict values of the `Pipeline` `pl` could not be done with only information of the `Pipeline` `pl` only. I'd like to have on object that have the data necessary to handle encoding, transforming, prediction, and decoding. – 021 Sep 14 '18 at 08:38
I am not sure this is possible with only one object, you have to write a function that sequentially encodes the targets, passes targets and training input variables to the pipeline, then fits. Then another that passes the test input variables to the pipeline.predict(), and decodes the output. Pipeline obj does not allow to do what you want, by itself. However, I'm thinking: would you consider to change task from classification to clustering? In that case you could LabelEncode all X's and y's after calling `numpy.hstack(X,y)`, and this can be fed to a pipeline object. – Daneel R. Sep 14 '18 at 08:45
Thank you for the feedback, this is useful information. However I don't see the process you described in your second part of your comment could work for me. – 021 Sep 14 '18 at 09:35
Let me try to pull up something, keep however in mind that since your X's are not categorical variables, but rather continuous values, the output of the clustering done on .hstack(X,y) will not have any way to be interpreted by humans. Also StandardScaler will have to be pulled out of the pipeline. – Daneel R. Sep 14 '18 at 09:40
Nevermind, after testing it looks like LabelEncoder does not perform column-wise encoding on 2D arrays. You may have to change to keras, sklearn does not provide a way to do what you want. – Daneel R. Sep 14 '18 at 09:47
[Related](https://stackoverflow.com/a/30267328/9649584). You cannot put LabelEncoder in a pipeline, You can however create a custom label encoder that accepts two arguments, only processes one, and passes both to the next step of the pipeline. – Daneel R. Sep 14 '18 at 09:57
The problem is, as far as I understand `Pipeline`, I could do a custom `LabelEncoder` that would take 2 arguments in `fit` but `Pipeline` does not provide a way to transform y and pass the y transformed through the pipeline. Only X can be altered if I'm correct. There's a 'swap' that I have to do when going through the `fit` of `LabelEncoder` that I cannot do. – 021 Sep 14 '18 at 12:27
I have been trying to create a custom class along the lines of the post mentioned above, and also tried to copy and rewrite the `LabelEncoder` without success: I kept getting the error you mentioned. I am now thinking: what if we give the variables to the pipeline in reverse order, thus `(y, X)`, and add a class between the encoder and the classifier whose sole purpose is to return the two arrays in reverse order, thus swapping them? I'll try to implement it. – Daneel R. Sep 14 '18 at 12:35
I think we'll hit a wall there as `transform` only returns 1 value, X or y, but never both. My plan was to create a class inheriting from `Pipeline` but that would take an encoder (like `LabelEncoder`) to encode y. Then I'll overwrite the methods (like `fit`, `transform`, `predict` etc) to encode y then call the superclass (`Pipeline` in this case) method. – 021 Sep 14 '18 at 12:41
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/180055/discussion-between-021-and-daniel-r). – 021 Sep 14 '18 at 12:46

Handling label encoding, transformation and estimation in one object

3 Answers3