How to include SimpleImputer before CountVectorizer in a scikit-learn Pipeline?

Question

I have a pandas DataFrame that includes a column of text, and I would like to vectorize the text using scikit-learn's CountVectorizer. However, the text includes missing values, and so I would like to impute a constant value before vectorizing.

My initial idea was to create a Pipeline of SimpleImputer and CountVectorizer:

import pandas as pd
import numpy as np
df = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]})

from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='constant')

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

from sklearn.pipeline import make_pipeline
pipe = make_pipeline(imp, vect)

pipe.fit_transform(df[['text']]).toarray()

However, the fit_transform errors because SimpleImputer outputs a 2D array and CountVectorizer requires 1D input. Here's the error message:

AttributeError: 'numpy.ndarray' object has no attribute 'lower'

QUESTION: How can I modify this Pipeline so that it will work?

NOTE: I'm aware that I can impute missing values in pandas. However, I would like to accomplish all preprocessing in scikit-learn so that the same preprocessing can be applied to new data using Pipeline.

Why not impute the missing values in the original dataframe: `df.fillna("")`? — DYZ, Jul 20 '20 at 21:34
@DYZ As I mentioned at the bottom of my question, I'd like to accomplish all of the preprocessing in scikit-learn so that I can use Pipeline to apply the same preprocessing to new data. — Kevin Markham, Jul 21 '20 at 13:02

score 15 · Accepted Answer · answered Jul 20 '20 at 17:00

The best solution I have found is to insert a custom transformer into the Pipeline that reshapes the output of SimpleImputer from 2D to 1D before it is passed to CountVectorizer.

Here's the complete code:

import pandas as pd
import numpy as np
df = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]})

from sklearn.impute import SimpleImputer
imp = SimpleImputer(strategy='constant')

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

# CREATE TRANSFORMER
from sklearn.preprocessing import FunctionTransformer
one_dim = FunctionTransformer(np.reshape, kw_args={'newshape':-1})

# INCLUDE TRANSFORMER IN PIPELINE
from sklearn.pipeline import make_pipeline
pipe = make_pipeline(imp, one_dim, vect)

pipe.fit_transform(df[['text']]).toarray()

It has been proposed on GitHub that CountVectorizer should allow 2D input as long as the second dimension is 1 (meaning: a single column of data). That modification to CountVectorizer would be a great solution to this problem!

Arash Khodadadi · Answer 2 · 2020-07-21T04:03:00.640

7

One solution would be to create a class off SimpleImputer and override its transform() method:

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer


class ModifiedSimpleImputer(SimpleImputer):
    def transform(self, X):
        return super().transform(X).flatten()


df = pd.DataFrame({'text':['abc def', 'abc ghi', np.nan]})

imp = ModifiedSimpleImputer(strategy='constant')

from sklearn.feature_extraction.text import CountVectorizer
vect = CountVectorizer()

from sklearn.pipeline import make_pipeline
pipe = make_pipeline(imp, vect)

pipe.fit_transform(df[['text']]).toarray()

edited Jul 21 '20 at 04:03

answered Jul 21 '20 at 02:14

Arash Khodadadi

71
3

3

Or vice versa: `class ModifiedCountVectorizer(CountVectorizer): def fit_transform(self, X, y=None): return super().fit_transform(X.flatten())` – Michael Gardner Jul 21 '20 at 14:37

Venkatachalam · Answer 3 · 2020-07-26T05:34:42.133

I use this one dimensional wrapper for sklearn Transformer when I have one dimensional data. I think, this wrapper can be used to wrap the simpleImputer for the one dimensional data (a pandas series with string values) in your case.

class OneDWrapper:
    """One dimensional wrapper for sklearn Transformers"""

    def __init__(self, transformer):
        self.transformer = transformer

    def fit(self, X, y=None):
        self.transformer.fit(np.array(X).reshape(-1, 1))
        return self

    def transform(self, X, y=None):
        return self.transformer.transform(
            np.array(X).reshape(-1, 1)).ravel()

    def inverse_transform(self, X, y=None):
        return self.transformer.inverse_transform(
            np.expand_dims(X, axis=1)).ravel()

Now, you don't need an additional step in the pipeline.

one_d_imputer = OneDWrapper(SimpleImputer(strategy='constant'))
pipe = make_pipeline(one_d_imputer, vect)
pipe.fit_transform(df['text']).toarray() 
# note we are feeding a pd.Series here!

How to include SimpleImputer before CountVectorizer in a scikit-learn Pipeline?

3 Answers3