I have a pandas DataFrame
that includes a column of text, and I would like to vectorize the text using scikit-learn's CountVectorizer
. However, the text includes missing values, and so I would like to impute a constant value before vectorizing.
My initial idea was to create a Pipeline
of SimpleImputer
and CountVectorizer
:
import pandas as pd
import numpy as np
df = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]})
from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='constant')
from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(imp, vect)
pipe.fit_transform(df[['text']]).toarray()
However, the fit_transform
errors because SimpleImputer
outputs a 2D array and CountVectorizer
requires 1D input. Here's the error message:
AttributeError: 'numpy.ndarray' object has no attribute 'lower'
QUESTION: How can I modify this Pipeline
so that it will work?
NOTE: I'm aware that I can impute missing values in pandas. However, I would like to accomplish all preprocessing in scikit-learn so that the same preprocessing can be applied to new data using Pipeline
.