use Featureunion in scikit-learn to combine two pandas columns for tfidf

Question

While using this as a model for spam classification, I'd like to add an additional feature of the Subject plus the body.

I have all of my features in a pandas dataframe. For example, the subject is df['Subject'], the body is df['body_text'] and the spam/ham label is df['ham/spam']

I receive the following error: TypeError: 'FeatureUnion' object is not iterable

How can I use both df['Subject'] and df['body_text'] as features all while running them through the pipeline function?

from sklearn.pipeline import FeatureUnion
features = df[['Subject', 'body_text']].values
combined_2 = FeatureUnion(list(features))

pipeline = Pipeline([
('count_vectorizer',  CountVectorizer(ngram_range=(1, 2))),
('tfidf_transformer',  TfidfTransformer()),
('classifier',  MultinomialNB())])

pipeline.fit(combined_2, df['ham/spam'])

k_fold = KFold(n=len(df), n_folds=6)
scores = []
confusion = numpy.array([[0, 0], [0, 0]])
for train_indices, test_indices in k_fold:
    train_text = combined_2.iloc[train_indices]
    train_y = df.iloc[test_indices]['ham/spam'].values

    test_text = combined_2.iloc[test_indices]
    test_y = df.iloc[test_indices]['ham/spam'].values

    pipeline.fit(train_text, train_y)
    predictions = pipeline.predict(test_text)
    prediction_prob = pipeline.predict_proba(test_text)

    confusion += confusion_matrix(test_y, predictions)
    score = f1_score(test_y, predictions, pos_label='spam')
    scores.append(score)

score 32 · Accepted Answer · edited Jun 08 '20 at 16:56

FeatureUnion was not meant to be used that way. It instead takes two feature extractors / vectorizers and applies them to the input. It does not take data in the constructor the way it is shown.

CountVectorizer is expecting a sequence of strings. The easiest way to provide it with that is to concatenate the strings together. That would pass both the text in both columns to the same CountVectorizer.

combined_2 = df['Subject'] + ' '  + df['body_text']

An alternative method would be to run CountVectorizer and optionally TfidfTransformer individually on each column, and then stack the results.

import scipy.sparse as sp

subject_vectorizer = CountVectorizer(...)
subject_vectors = subject_vectorizer.fit_transform(df['Subject'])

body_vectorizer = CountVectorizer(...)
body_vectors = body_vectorizer.fit_transform(df['body_text'])

combined_2 = sp.hstack([subject_vectors, body_vectors], format='csr')

A third option is to implement your own transformer that would extract a dataframe column.

class DataFrameColumnExtracter(TransformerMixin):

    def __init__(self, column):
        self.column = column

    def fit(self, X, y=None):
        return self

    def transform(self, X, y=None):
        return X[self.column]

In that case you could use FeatureUnion on two pipelines, each containing your custom transformer, then CountVectorizer.

subj_pipe = make_pipeline(
       DataFrameColumnExtracter('Subject'), 
       CountVectorizer()
)

body_pipe = make_pipeline(
       DataFrameColumnExtracter('body_text'), 
       CountVectorizer()
)

feature_union = make_union(subj_pipe, body_pipe)

This feature union of pipelines will take the dataframe and each pipeline will process its column. It will produce the concatenation of term count matrices from the two columns given.

 sparse_matrix_of_counts = feature_union.fit_transform(df)

This feature union can also be added as the first step in a larger pipeline.

I feel this is a good reference for the same as well. [FeatureUnion](http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html#example-hetero-feature-union-py) — Pramit, Jun 02 '16 at 02:15
Exactly what I've been looking for. I wonder if this should have been part of sklearn out of the box. — pckben, Sep 28 '17 at 17:49
@David I have tried your third option but it returns a "ValueError: Expected 2D array, got 1D array instead" — Stamatis Tiniakos, Jun 28 '19 at 10:38

use Featureunion in scikit-learn to combine two pandas columns for tfidf

1 Answers1

Linked

Related