0

High-level

  1. Convert text data to a count matrix and call it X.
  2. Convert integer data to binary and call it y.
  3. Feed data to sklearn LogisticRegression

Primary Question

How to convert y (not X, not X and y together, just y) to binary as the FIRST step within an sklearnPipeline.

Example

df = pd.DataFrame({'Text': ['i am a text', 'i am also text', 'turn text into counts',
          'binarize me as text please'], 'Integer': [20, 0, 4, 0]},
          columns=['Text', 'Integer'])

Sample df

                         Text  Integer
0                 i am a text       20
1              i am also text        0
2       turn text into counts        4
3  binarize me as text please        0

I know I can do the following with a Pipeline:

X = df['Text']
y = df['Integer']

pipeline = Pipeline(steps=[
    ('tfidf', TfidfVectorizer(ngram_range=(1,2), stop_words='english')),
    ('lr', LogisticRegression()),
    ])

Then fit using X, y:

pipeline.fit(X, y)

What I don't Understand

Since I pass BOTH X and y to pipeline.fit(X, y), how can I specify within the pipeline to first convert y to binary (0, 1) classes?

I realize I can convert y before-hand (see below) but the heart of my question is, how to do the preprocessing of y within the Pipeline using sklearn functions.

y = np.where(df['Integer'] >= 1, 1, 0)

Other Notes

I am aware of and tried Binarizer on y and it would work for example if I preprocess y in the pipeline.fit method itself like pipeline.fit(X, Binarizer().transform(y.reshape((len(y), 1)))[:, 0]) but, again, my intent here is to learn how to preprocess y in the pipeline (if possible) and not within the fit method or before-hand.

>>> from sklearn.preprocessing import Binarizer
>>> Binarizer().fit_transform(y)

Warning (from warnings module):
  File "C:\Python34\lib\site-packages\sklearn\utils\validation.py", line 386
    DeprecationWarning)
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.

Warning (from warnings module):
  File "C:\Python34\lib\site-packages\sklearn\utils\validation.py", line 386
    DeprecationWarning)
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
array([[1, 0, 1, 0]])
>>> b = Binarizer()
>>> b.transform(y)

Warning (from warnings module):
  File "C:\Python34\lib\site-packages\sklearn\utils\validation.py", line 386
    DeprecationWarning)
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
array([[1, 0, 1, 0]])
Jarad
  • 17,409
  • 19
  • 95
  • 154
  • It is not possible to preprocess `y` in a traditional scikit-learn pipeline. The problem is `transform` which returns the transformed value only returns a transformed `X`. Maybe someone could suggest an alternative pipeline for scikit-learn, if one exists. – David Maust Jan 12 '16 at 07:00
  • 1
    To point you to a [previous post](http://stackoverflow.com/questions/18602489/using-a-transformer-estimator-to-transform-the-target-labels-in-sklearn-pipeli) that supports Davids comment. What does `Integer` represent in your example? – Kevin Jan 12 '16 at 13:44
  • @Kevin That link helps a lot. FWIW, `Integer` in this example represents the quantity of successes. I was converting to `0` or `1` because I'm not interested in the quantity but rather if it has ever had a success or not. – Jarad Jan 12 '16 at 16:27
  • Possible duplicate of [Using a transformer (estimator) to transform the target labels in sklearn.pipeline](https://stackoverflow.com/questions/18602489/using-a-transformer-estimator-to-transform-the-target-labels-in-sklearn-pipeli) – Venkatachalam Jan 20 '19 at 15:42

0 Answers0