I have a pandas data frame with 56 columns. Around half of the columns are float and the others are string(textual data) and finally col56 is the label column. The dataset looks something like this
Col1 Col2...Col26 Col27 Col 28 ..... Col55 Col 56
1 4 76 I like cats Cats are cool Cat bags 1
.
.
.
1900 rows
I want to use both numeric and textual data to run classification algorithms. A quick google search told that the best way to proceed is by using Feature Union
This is the code so far
import pandas as pd
import numpy as np
from sklearn.preprocessing import FunctionTransformer
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.svm import SVC
from sklearn.pipeline import FeatureUnion
from sklearn.feature_extraction.text import CountVectorizer
df=pd.read_csv('url')
X=df[[Col1...Col55]]
y=df[[Col56]]
from sklearn.model_selection import train_test_split
stop_list=(i, am, the...)
X_train,X_test,y_train,y_test=train_test_split(X,y,test_size=0.3,random_state=42)
pipeline = Pipeline([
('union',FeatureUnion([
('Col1', Pipeline([
('selector', ItemSelector(column='Col1')),
('caster', ArrayCaster())
])),
.
.
.
.
.
('Col27',Pipeline([
('selector', ItemSelector(column='Col27')),
('vectorizer', CountVectorizer())
])),
.
.
.
('Col55',Pipeline([
('selector', ItemSelector(column='Col55')),
('vectorizer', CountVectorizer())
]))
])),
('model',SVC())
])
Then I get an error
TypeError Traceback (most recent call last)
<ipython-input-8-7a2cab7bed7d> in <module>
167 (' Col27',Pipeline([
168 ('selector', ItemSelector(column=' Col27')),
--> 169 ('vectorizer', CountVectorizer(stop_words=stop_list))
170 ]))
TypeError: 'tuple' object is not callable
I don't understand since the exact same method is used here and here And there doesn't seem any error. What am I doing wrong? How can I fix this?