0

I have a pandas data frame with 56 columns. Around half of the columns are float and the others are string(textual data) and finally col56 is the label column. The dataset looks something like this

Col1 Col2...Col26 Col27       Col 28   ..... Col55     Col 56
1    4      76    I like cats Cats are cool  Cat bags  1
.
.
.
1900 rows

I want to use both numeric and textual data to run classification algorithms. A quick google search told that the best way to proceed is by using Feature Union

This is the code so far

import pandas as pd
import numpy as np
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.svm import SVC
from sklearn.pipeline import FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer

df=pd.read_csv('url')
X=df[[Col1...Col55]]
y=df[[Col56]]
from sklearn.model_selection import train_test_split
stop_list=(i, am, the...)
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=42)
pipeline = Pipeline([
    ('union',FeatureUnion([
        ('Col1', Pipeline([
            ('selector', ItemSelector(column='Col1')),
            ('caster', ArrayCaster())
            ])),
.
.
.
.
.
        ('Col27',Pipeline([
            ('selector', ItemSelector(column='Col27')),
            ('vectorizer', CountVectorizer())
            ])), 
.
.
. 
        ('Col55',Pipeline([
            ('selector', ItemSelector(column='Col55')),
            ('vectorizer', CountVectorizer())
            ]))
])),
('model',SVC())
])

Then I get an error

TypeError                                 Traceback (most recent call last)
<ipython-input-8-7a2cab7bed7d> in <module>
    167         (' Col27',Pipeline([
    168             ('selector', ItemSelector(column=' Col27')),
--> 169             ('vectorizer', CountVectorizer(stop_words=stop_list))
    170         ]))

TypeError: 'tuple' object is not callable

I don't understand since the exact same method is used here and here And there doesn't seem any error. What am I doing wrong? How can I fix this?

1 Answers1

0

I think the issue is with CountVectorizer.

    cv = CountVectorizer
    word_count_vector = cv.fit_transform(data)
    word_count_vector = cv.shape()

This produces the same error as you. You could actually do the stuff manually. Use CountVectorizer to create a sparse matrix of the data and align it with your numerical data matrix or dataframe by using spare.hstack from scipy. It horizontally stacks the two matrices with equal rows and equal/different columns.

MOR
  • 11
  • 4