I am learning about sklearn
custom transformers and read about the two core ways to create custom transformers:
- by setting up a custom class that inherits from
BaseEstimator
andTransformerMixin
, or - by creating a transformation method and passing it to
FunctionTransformer
.
I wanted to compare these two approaches by implementing a "meta-vectorizer" functionality: a vectorizer that supports either CountVectorizer
or TfidfVectorizer
and transforms the input data according to the specified vectorizer type.
However, I can't seem to get any of the two work when passing them to a sklearn.pipeline.Pipeline
. I am getting the following error message in the fit_transform()
step:
ValueError: all the input array dimensions for the concatenation axis must match
exactly, but along dimension 0, the array at index 0 has size 6 and the array
at index 1 has size 1
My code for option 1 (using a custom class):
class Vectorizer(BaseEstimator, TransformerMixin):
def __init__(self, vectorizer:Callable=CountVectorizer(), ngram_range:tuple=(1,1)) -> None:
super().__init__()
self.vectorizer = vectorizer
self.ngram_range = ngram_range
def fit(self, X, y=None):
return self
def transform(self, X, y=None):
X_vect_ = self.vectorizer.fit_transform(X.copy())
return X_vect_.toarray()
pipe = Pipeline([
('column_transformer', ColumnTransformer([
('lesson_type_category', OneHotEncoder(), ['Type']),
('comment_text_vectorizer', Vectorizer(), ['Text'])],
remainder='drop')),
('model', LogisticRegression())])
param_dict = {'column_transformer__comment_text_vectorizer__vectorizer': \
[CountVectorizer(), TfidfVectorizer()]
}
randsearch = GridSearchCV(pipe, param_dict, cv=2, scoring='f1',).fit(X_train, y_train)
And my code for option 2 (creating a custom transformer from a function using FunctionTransformer
):
def vectorize_text(X, vectorizer: Callable):
X_vect_ = vectorizer.fit_transform(X)
return X_vect_.toarray()
vectorizer_transformer = FunctionTransformer(vectorize_text, kw_args={'vectorizer': TfidfVectorizer()})
pipe = Pipeline([
('column_transformer', ColumnTransformer([
('lesson_type_category', OneHotEncoder(), ['Type']),
('comment_text_vectorizer', vectorizer_transformer, ['Text'])],
remainder='drop')),
('model', LogisticRegression())])
param_dict = {'column_transformer__comment_text_vectorizer__kw_args': \
[{'vectorizer':CountVectorizer()}, {'vectorizer': TfidfVectorizer()}]
}
randsearch = GridSearchCV(pipe, param_dict, cv=2, scoring='f1').fit(X_train, y_train)
Imports and sample data:
import pandas as pd
from typing import Callable
import sklearn
from sklearn.preprocessing import OneHotEncoder, FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.model_selection import GridSearchCV
df = pd.DataFrame([
['A99', 'hi i love python very much', 'c', 1],
['B07', 'which programming language should i learn', 'b', 0],
['A12', 'what is the difference between python django flask', 'b', 1],
['A21', 'i want to be a programmer one day', 'c', 0],
['B11', 'should i learn java or python', 'b', 1],
['C01', 'how much can i earn as a programmer with python', 'a', 0]
], columns=['Src', 'Text', 'Type', 'Target'])
Notes:
- As recommended in this question, I transformed all sparse matrices to dense arrays after the vectorization, as you can see in both cases:
X_vect_.toarray()
.