High-level
- Convert text data to a count matrix and call it
X
. - Convert integer data to binary and call it
y
. - Feed data to sklearn
LogisticRegression
Primary Question
How to convert y
(not X
, not X
and y
together, just y
) to binary as the FIRST step within an sklearnPipeline
.
Example
df = pd.DataFrame({'Text': ['i am a text', 'i am also text', 'turn text into counts',
'binarize me as text please'], 'Integer': [20, 0, 4, 0]},
columns=['Text', 'Integer'])
Sample df
Text Integer
0 i am a text 20
1 i am also text 0
2 turn text into counts 4
3 binarize me as text please 0
I know I can do the following with a Pipeline
:
X = df['Text']
y = df['Integer']
pipeline = Pipeline(steps=[
('tfidf', TfidfVectorizer(ngram_range=(1,2), stop_words='english')),
('lr', LogisticRegression()),
])
Then fit using X
, y
:
pipeline.fit(X, y)
What I don't Understand
Since I pass BOTH X
and y
to pipeline.fit(X, y)
, how can I specify within the pipeline to first convert y
to binary (0
, 1
) classes?
I realize I can convert y
before-hand (see below) but the heart of my question is, how to do the preprocessing of y
within the Pipeline
using sklearn functions.
y = np.where(df['Integer'] >= 1, 1, 0)
Other Notes
I am aware of and tried Binarizer
on y
and it would work for example if I preprocess y
in the pipeline.fit
method itself like pipeline.fit(X, Binarizer().transform(y.reshape((len(y), 1)))[:, 0])
but, again, my intent here is to learn how to preprocess y
in the pipeline
(if possible) and not within the fit
method or before-hand.
>>> from sklearn.preprocessing import Binarizer
>>> Binarizer().fit_transform(y)
Warning (from warnings module):
File "C:\Python34\lib\site-packages\sklearn\utils\validation.py", line 386
DeprecationWarning)
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
Warning (from warnings module):
File "C:\Python34\lib\site-packages\sklearn\utils\validation.py", line 386
DeprecationWarning)
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
array([[1, 0, 1, 0]])
>>> b = Binarizer()
>>> b.transform(y)
Warning (from warnings module):
File "C:\Python34\lib\site-packages\sklearn\utils\validation.py", line 386
DeprecationWarning)
DeprecationWarning: Passing 1d arrays as data is deprecated in 0.17 and willraise ValueError in 0.19. Reshape your data either using X.reshape(-1, 1) if your data has a single feature or X.reshape(1, -1) if it contains a single sample.
array([[1, 0, 1, 0]])