0

I have a data frame with 4 different groups of features.

I need to create 4 different models with these four different feature groups and combine them with the ensemble voting classifier. Furthermore, I need to test the classifier using k-fold cross validation.

However, I am finding it difficult to combine different feature sets, voting classifier and k-fold cross validation with functionality available in sklearn. Following is the code that I have so far.

y = df1.index
x = preprocessing.scale(df1)

SVM = svm.SVC(kernel='rbf', C=1)
rf=RandomForestClassifier(n_estimators=200)
ann = MLPClassifier(solver='lbfgs', alpha=1e-5, hidden_layer_sizes=(25, 2), random_state=1)
neigh = KNeighborsClassifier(n_neighbors=10)

models = list()
models.append(('facial', SVM))
models.append(('posture', rf))
models.append(('computer', ann))
models.append(('physio', neigh))

ens = VotingClassifier(estimators=models)

cv = KFold(n_splits=10, random_state=None, shuffle=True)
scores = cross_val_score(ens, x, y, cv=cv, scoring='accuracy')

As you can see, this program uses same features for all 4 models. How can I improve this program to achieve my objective?

Chamila Wijayarathna
  • 1,815
  • 5
  • 30
  • 54
  • Are you getting any error your code? – Parthasarathy Subburaj May 28 '20 at 13:52
  • This works fine, but my objective is to use different groups of features for each model. Here all models use all the features available in my dataset. – Chamila Wijayarathna May 28 '20 at 13:55
  • This might be helpful https://stackoverflow.com/questions/45074579/votingclassifier-different-feature-sets – Parthasarathy Subburaj May 28 '20 at 14:10
  • I already referred this, however, answers posted their do not use k-fold cross validation – Chamila Wijayarathna May 28 '20 at 14:13
  • 1
    You need to append a column selection before each estimator. See [the example here](https://scikit-learn.org/stable/auto_examples/compose/plot_column_transformer_mixed_types.html#use-columntransformer-by-selecting-column-by-names). So your final `VotingClassifier` will have list of pipelines (one for each column selector and estimator). Try and implement this approach. If still not able to solve, I will post an answer. – Vivek Kumar May 28 '20 at 16:12
  • I managed to get the cross validation part, but I am not sure how to create the pipeline with ColumnTransform, I tried ColumnSelector in 'mlxtend', but getting type error saying 'argument of type 'ColumnSelector' is not iterable'. https://gist.github.com/cdwijayarathna/5425919a39dea2f8e9d8bf79c02d544d – Chamila Wijayarathna May 28 '20 at 16:42
  • @VivekKumar I updated the code to follow the example you provided, https://gist.github.com/cdwijayarathna/3dd073cf3ab99b9e757b82e701f67525, However, I am still getting "TypeError: argument of type 'ColumnTransformer' is not iterable', what am I missing here? – Chamila Wijayarathna May 28 '20 at 17:15
  • I did managed to get it to work, https://stackoverflow.com/questions/62079006/sklearn-pipeline-argument-of-type-columntransformer-is-not-iterable/62079963#62079963 – Chamila Wijayarathna May 29 '20 at 06:32

1 Answers1

0

I did manage to achieve this using Pipelines,

y = df1.index
x = preprocessing.scale(df1)

phy_features = ['A', 'B', 'C']
phy_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])
phy_processer = ColumnTransformer(transformers=[('phy', phy_transformer, phy_features)])

fa_features = ['D', 'E', 'F']
fa_transformer = Pipeline(steps=[('imputer', SimpleImputer(strategy='median')), ('scaler', StandardScaler())])
fa_processer = ColumnTransformer(transformers=[('fa', fa_transformer, fa_features)])


pipe_phy = Pipeline(steps=[('preprocessor', phy_processer ),('classifier', SVM)])
pipe_fa = Pipeline(steps=[('preprocessor', fa_processer ),('classifier', SVM)])

ens = VotingClassifier(estimators=[pipe_phy, pipe_fa])

cv = KFold(n_splits=10, random_state=None, shuffle=True)
for train_index, test_index in cv.split(x):
    x_train, x_test = x[train_index], x[test_index]
    y_train, y_test = y[train_index], y[test_index]
    ens.fit(x_train,y_train)
    print(ens.score(x_test, y_test))

Please refer sklearn Pipeline: argument of type 'ColumnTransformer' is not iterable for if you are receiving an TypeError when using ColumnTransforms.

Chamila Wijayarathna
  • 1,815
  • 5
  • 30
  • 54