how to featureUnion numerical and text features in python sklearn properly

Question

I'm trying to use featureunion for the 1st time in sklearn pipeline to combine numerical (2 columns) and text features (1 column) for multi-class classification.

from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import Pipeline
from sklearn.multiclass import OneVsRestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import FeatureUnion

get_text_data = FunctionTransformer(lambda x: x['text'], validate=False)
get_numeric_data = FunctionTransformer(lambda x: x[['num1','num2']], validate=False)

process_and_join_features = FeatureUnion(
         [
            ('numeric_features', Pipeline([
                ('selector', get_numeric_data),
                ('clf', OneVsRestClassifier(LogisticRegression()))
            ])),
             ('text_features', Pipeline([
                ('selector', get_text_data),
                ('vec', CountVectorizer()),
                ('clf', OneVsRestClassifier(LogisticRegression()))
            ]))
         ]
    )

In this code 'text' is the text columns and 'num1','num2' are 2 numeric column.

The error message is

TypeError: All estimators should implement fit and transform. 'Pipeline(memory=None,
 steps=[('selector', FunctionTransformer(accept_sparse=False,
      func=<function <lambda> at 0x7fefa8efd840>, inv_kw_args=None,
      inverse_func=None, kw_args=None, pass_y='deprecated',
      validate=False)), ('clf', OneVsRestClassifier(estimator=LogisticRegression(C=1.0, class_weigh...=None, solver='liblinear', tol=0.0001,
      verbose=0, warm_start=False),
      n_jobs=1))])' (type <class 'sklearn.pipeline.Pipeline'>) doesn't

Any step I missed?

First, your `'clf', OneVsRestClassifier(LogisticRegression()` should be a third step in the pipeline, not combined in the second step with text. Second, please share some sample data and full stack trace of error. Are you calling fit(), or predict() on pipeline? — Vivek Kumar, Dec 11 '17 at 02:09

score 13 · Accepted Answer · answered Dec 11 '17 at 09:49

A FeatureUnion should be used as a step in the pipeline, not around the pipeline. The error you are getting is because you have a Classifier not as the final step - the union tries to call fit and transform on all transformers and a classifier does not have a transform method.

Simply rework to have an outer pipeline with the classifier as the final step:

process_and_join_features = Pipeline([
    ('features', FeatureUnion([
            ('numeric_features', Pipeline([
                ('selector', get_numeric_data)
            ])),
             ('text_features', Pipeline([
                ('selector', get_text_data),
                ('vec', CountVectorizer())
            ]))
         ])),
    ('clf', OneVsRestClassifier(LogisticRegression()))
])

Also see here for a good example on the scikit-learn website doing this sort of thing.

thank you for your simple and clear explanation! it works now — santoku, Dec 11 '17 at 13:42
If not done so, still read Zac's [blog](http://zacstewart.com/2014/08/05/pipelines-of-featureunions-of-pipelines.html). It helped me to understand pipelines. The flow charts make it quite clear how FeatureUnion and Pipeline work. In fact I now sometimes draw similar ones, if my pipes get too complicated. — Marcus V., Dec 11 '17 at 15:54

score 6 · Answer 2 · answered Dec 11 '17 at 10:43

While I believe @Ken Syme correctly identified the problem and provided a fix for what you intend to do. However, just in case you actually intend to use the output of the classifier as a feature for a higher level model, check out this blog.

Using the ModelTransformer by Zac, you can have your pipe as follows:

class ModelTransformer(TransformerMixin):

    def __init__(self, model):
        self.model = model

    def fit(self, *args, **kwargs):
        self.model.fit(*args, **kwargs)
        return self

    def transform(self, X, **transform_params):
        return DataFrame(self.model.predict(X))


process_and_join_features = FeatureUnion(
         [
            ('numeric_features', Pipeline([
                ('selector', get_numeric_data),
                ('clf', ModelTransformer(OneVsRestClassifier(LogisticRegression())))
            ])),
             ('text_features', Pipeline([
                ('selector', get_text_data),
                ('vec', CountVectorizer()),
                ('clf', ModelTransformer(OneVsRestClassifier(LogisticRegression())))
            ]))
         ]
)

Depending on your concrete next steps you still may have to wrap the FeatureUnion in a Pipeline (e.g. using the shortcut make_pipeline).

how to featureUnion numerical and text features in python sklearn properly

2 Answers2

Linked