Similar : Pipeline doesn't work with Label Encoder
I'd like to have an object that handles label encoding (in my case with a LabelEncoder
), transformation and estimation. It is important to me that all theses functions can be executed through only one object.
I've tried using a pipeline this way :
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import LabelEncoder
# mock training dataset
X = np.random.rand(1000, 100)
y = np.concatenate([["label1"] * 300, ["label2"] * 300, ["label3"] * 400])
le = LabelEncoder()
ss = StandardScaler()
clf = MyClassifier()
pl = Pipeline([('encoder', le),
('scaler', ss),
('clf', clf)])
pl.fit(X, y)
Which gives :
File "sklearn/pipeline.py", line 581, in _fit_transform_one
res = transformer.fit_transform(X, y, **fit_params)
TypeError: fit_transform() takes exactly 2 arguments (3 given)
Clarifications :
X
andy
are my training dataset,X
being the values andy
the targeted labels.X
is anumpy.ndarray
of shape (n_sample, n_features) and of type float, values ranging from 0 to 1.y
is anumpy.ndarray
of shape (n_sample,) and of type stringI expect
LabelEncoder
to encodey
, notX
.I need
y
only forMyClassifier
, and I need it encoded to integers forMyClassifier
to work.
After some thoughts and facing the error above, I feel like it was naive to think that Pipeline
could handle it. I figured out that Pipeline
could very well handle my transformation and classifier together but it was the label encoding part that would fail.
What is the correct way to achieve what I want ? By correct I mean to do something that would allow reusability and some kind of consistency with sklearn
. Is there a class in sklearn
library that do what I want ?
I'm pretty surprised I haven't found an answer browsing the web because I feel like what I'm doing is nothing uncommon. I might be missing something here.