1

I'm trying to use CountVectorizer() with Pipeline and ColumnTransformer. Because CountVectorizer() produces sparse matrix, I used FunctionTransformer to ensure the ColumnTransformer can hstack correctly when putting together the resulting matrix.

import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from typing import Callable

# Dataset
df = pd.DataFrame([['a', 'Hi Tom', 'It is hot', 1],
                    ['b', 'How you been Tom', 'hot coffee', 2],
                    ['c', 'Hi you', 'I want some coffee', 3]],
                   columns=['col_for_ohe', 'col_for_countvectorizer_1', 'col_for_countvectorizer_2', 'num_col'])

# Use FunctionTransformer to ensure dense matrix
def tf_text(X, vectorizer_tf: Callable):
    X_vect_ = vectorizer_tf.fit_transform(X)
    return X_vect_.toarray()

tf_transformer = FunctionTransformer(tf_text, kw_args={'vectorizer_tf': CountVectorizer()})

# Transformation Pipelines
tf_transformer_pipe = Pipeline(
    steps = [('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
             ('tf', tf_transformer)])

ohe_transformer_pipe = Pipeline(
    steps = [('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
             ('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False))])

transformer = ColumnTransformer(transformers=[
    ('cat_ohe', ohe_transformer_pipe, ['col_for_ohe']),
    ('cat_tf', tf_transformer_pipe, ['col_for_countvectorizer_1', 'col_for_countvectorizer_2'])
], remainder='passthrough')

transformed_df = transformer.fit_transform(df)

I get AttributeError: 'numpy.ndarray' object has no attribute 'lower.' I've seen this question and suspect CountVectorizer() is the culprit but not sure how to solve it (previous question doesn't use ColumnTransformer). I stumbled upon a DenseTransformer that I wish I could use instead of FunctionTransformer but unfortunately it is not supported in my company.

DJL
  • 144
  • 1
  • 12
  • `lower` is a method of python strings. pandas may also use it. A numpy array does not. You need to figure out why object in question is an array. You have to examine the traceback. – hpaulj Apr 09 '22 at 06:40

3 Answers3

2

Imo, the first consideration to be done is that CountVectorizer() requires 1D input; your example is not working because the imputation is returning a 2D numpy array which means that you'll need to add a customized treatment to make it work.

Then you should also consider that when using a CountVectorizer() instance (which - again - requires 1D input) as transformer in a ColumnTransformer() that's how you should pass transformers' columns:

columns: str, array-like of str, int, array-like of int, array-like of bool, slice or callable

Indexes the data on its second axis. Integers are interpreted as positional columns, while strings can reference DataFrame columns by name. A scalar string or int should be used where transformer expects X to be a 1d array-like (vector), otherwise a 2d array will be passed to the transformer. [...]

This would be useful in interpreting the snippet I'll post as a possible solution.

import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.pipeline import Pipeline
from typing import Callable
from sklearn.base import BaseEstimator, TransformerMixin

# Dataset
df = pd.DataFrame([['a', 'Hi Tom', 'It is hot', 1],
                ['b', 'How you been Tom', 'hot coffee', 2],
                ['c', 'Hi you', 'I want some coffee', 3]],
               columns=['col_for_ohe', 'col_for_countvectorizer_1', 'col_for_countvectorizer_2', 'num_col'])

class DimTransformer(BaseEstimator, TransformerMixin):
    def __init__(self):
        pass
    def fit(self, *_):
        return self
    def transform(self, X, *_):
        return pd.DataFrame(X)

# Use FunctionTransformer to ensure dense matrix
def tf_text(X, vectorizer_tf: Callable):
    X_vect_ = vectorizer_tf.fit_transform(X)
    return X_vect_.toarray()

tf_transformer = FunctionTransformer(tf_text, kw_args={'vectorizer_tf': CountVectorizer()})

# Transformation Pipelines
tf_transformer_pipe = Pipeline(
    steps = [('imputer', SimpleImputer(strategy='constant', fill_value='missing')), 
             ('dt', DimTransformer()),
             ('ct', ColumnTransformer([
                 ('tf1', tf_transformer, 0), 
                 ('tf2', tf_transformer, 1)
             ]))    
])

ohe_transformer_pipe = Pipeline(
    steps = [('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
             ('ohe', OneHotEncoder(handle_unknown='ignore', sparse=False))])

transformer = ColumnTransformer(transformers=[
    ('cat_ohe', ohe_transformer_pipe, ['col_for_ohe']),
    ('cat_tf', tf_transformer_pipe, ['col_for_countvectorizer_1', 'col_for_countvectorizer_2'])
], remainder='passthrough')

transformed_df = transformer.fit_transform(df)

enter image description here

Namely, I'm adding a transformer that simply transforms the array returned by the SimpleImputer instance in a DataFrame. Then - and most importantly - since it seems not possible to apply the vectorization on the 2D input that comes out of the previous two steps ('imputer' and 'dt') I'm adding a further ColumnTransformer which splits the vectorization in two parallel steps (a vectorization per column). Notice that at this point columns are referenced positionally as column names have possibly changed. Of course, that's a custom solution, but at least may provide some hints.

Given that you don't actually have missing values, you can see that it actually works by comparing it with the output from:

dt = DimTransformer().fit_transform(df)
ct = ColumnTransformer([
    ('tf1', tf_transformer, 1), 
    ('tf2', tf_transformer, 2)
])
ct.fit_transform(dt)

print(ct.named_transformers_['tf1'].kw_args['vectorizer_tf'].vocabulary_) print(ct.named_transformers_['tf2'].kw_args['vectorizer_tf'].vocabulary_)

and noticing that columns from fourth to the last but one of the previous output (namely those affected by the application of 'cat_tf') do coincide with the ones just below.

enter image description here

Here are a couple of posts with focus on the usage of CountVectorizer in a ColumnTransformer instance, though they did not consider imputing the dataset beforehand.

amiola
  • 2,593
  • 1
  • 11
  • 25
0

In CountVectorizer, pass lower_case=False.

  • I think there is no underscore, as in lowercase=False. When I try this, I get a different error, 'TypeError: expected string or bytes-like object' – DJL Apr 09 '22 at 06:42
-3

I think you should really look back over your basics again. Your question tells me you don’t understand the function well enough to implement it effectively. Ask again when you’ve done enough research on your own to not embarrass yourself.

  • 1
    Yes, I am a beginner. But I am trying my best and hopefully will get better over time, by getting help from those who are willing to help. I gathered the courage to post a question despite my ignorance, please don't discourage people who are trying to improve and are still early in their journey – DJL Apr 09 '22 at 06:33
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Apr 09 '22 at 08:42