sklearn - Binarize y only as first step in pipeline

Question

High-level

Convert text data to a count matrix and call it X.
Convert integer data to binary and call it y.
Feed data to sklearn LogisticRegression

Primary Question

How to convert y (not X, not X and y together, just y) to binary as the FIRST step within an sklearnPipeline.

Example

df = pd.DataFrame({'Text': ['i am a text', 'i am also text', 'turn text into counts',
          'binarize me as text please'], 'Integer': [20, 0, 4, 0]},
          columns=['Text', 'Integer'])

Sample df

                         Text  Integer
0                 i am a text       20
1              i am also text        0
2       turn text into counts        4
3  binarize me as text please        0

I know I can do the following with a Pipeline:

X = df['Text']
y = df['Integer']

pipeline = Pipeline(steps=[
    ('tfidf', TfidfVectorizer(ngram_range=(1,2), stop_words='english')),
    ('lr', LogisticRegression()),
    ])

Then fit using X, y:

pipeline.fit(X, y)

What I don't Understand

Since I pass BOTH X and y to pipeline.fit(X, y), how can I specify within the pipeline to first convert y to binary (0, 1) classes?

I realize I can convert y before-hand (see below) but the heart of my question is, how to do the preprocessing of y within the Pipeline using sklearn functions.

y = np.where(df['Integer'] >= 1, 1, 0)

Other Notes

I am aware of and tried Binarizer on y and it would work for example if I preprocess y in the pipeline.fit method itself like pipeline.fit(X, Binarizer().transform(y.reshape((len(y), 1)))[:, 0]) but, again, my intent here is to learn how to preprocess y in the pipeline (if possible) and not within the fit method or before-hand.

>>> from sklearn.preprocessing import Binarizer
>>> Binarizer().fit_transform(y)

Warning (from warnings module):
  File "C:\Python34\lib\site-packages\sklearn\utils\validation.py", line 386
    DeprecationWarning)
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.

Warning (from warnings module):
  File "C:\Python34\lib\site-packages\sklearn\utils\validation.py", line 386
    DeprecationWarning)
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
array([[1, 0, 1, 0]])
>>> b = Binarizer()
>>> b.transform(y)

Warning (from warnings module):
  File "C:\Python34\lib\site-packages\sklearn\utils\validation.py", line 386
    DeprecationWarning)
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
array([[1, 0, 1, 0]])

It is not possible to preprocess `y` in a traditional scikit-learn pipeline. The problem is `transform` which returns the transformed value only returns a transformed `X`. Maybe someone could suggest an alternative pipeline for scikit-learn, if one exists. — David Maust, Jan 12 '16 at 07:00
To point you to a [previous post](http://stackoverflow.com/questions/18602489/using-a-transformer-estimator-to-transform-the-target-labels-in-sklearn-pipeli) that supports Davids comment. What does `Integer` represent in your example? — Kevin, Jan 12 '16 at 13:44
@Kevin That link helps a lot. FWIW, `Integer` in this example represents the quantity of successes. I was converting to `0` or `1` because I'm not interested in the quantity but rather if it has ever had a success or not. — Jarad, Jan 12 '16 at 16:27
Possible duplicate of [Using a transformer (estimator) to transform the target labels in sklearn.pipeline](https://stackoverflow.com/questions/18602489/using-a-transformer-estimator-to-transform-the-target-labels-in-sklearn-pipeli) — Venkatachalam, Jan 20 '19 at 15:42

sklearn - Binarize y only as first step in pipeline

0 Answers0