2

This is a pretty straightforward question, which I don't think I can add much than the direct question: how can I combine pipeline with cross_val_score for a multiclass problem?

I was working with a multiclass problem at work (this is why I won't share any data, but one can think of the problem as something with iris dataset), where I needed to classify some texts accordingly with the topic. This is what I was doing:

pipe = Pipeline(
steps=[
    ("vect", CountVectorizer()),
    ("feature_selection", SelectKBest(chi2, k=10)),
    ("reg", RandomForestClassifier()),
])

pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)

print(classification_report(y_test, y_pred))

However, I'm a little worried about overfitting (even though I'm evaluating with the test set) and I wanted to make a more rigorous analysis and add cross validation. The problem is that I don't know how to add cross_val_score in the pipeline, neither how to evaluate a multiclass problem with cross validation. I saw this answer, and so I added this to my script:

cv = KFold(n_splits=5)
scores = cross_val_score(pipe, X_train, y_train, cv = cv)

The problem is that this results in the accuracy, which is not so good when we are discussing classification problems.

Are there any alternatives? Is it possible to make cross validation and not getting only the accuracy? Or should I stick to accuracy and this isn't a problem due to any reason?

I know the question got too 'broad', and it's actually not only about cross validation, I hope this is not an issue.

Thanks in advance

Yuxxxxxx
  • 203
  • 1
  • 5

1 Answers1

1

It is almost always advisable to use cross validation to choose your model/hyperparameters, then to use an independent hold out test set to evaluate the performance of the model.

The good news is that you can do exactly what you wish to do, all within scikit-learn! Something like this:

pipe = Pipeline(
  steps=[
    ("vect", CountVectorizer()),
    ("feature_selection", SelectKBest(chi2, k=10)),
    ("reg", RandomForestClassifier())])

# Parameters of pipelines can be set using ‘__’ separated parameter names:
param_grid = {
    'feature_selection__k': np.linspace(4, 16, 4), # Test different number of features in SelectKBest
    'reg__n_estimators': [10, 30, 50, 100, 200],  # n_estimators in RandomForestClassifier
    'reg__min_samples_leaf': [2, 5, 10, 50] # min_samples_leaf in RandomForestClassifier
}

# This defines the grid search with "Area Under the ROC Curve" as the scoring metric to use.
# More options here: https://scikit-learn.org/stable/modules/model_evaluation.html#scoring-parameter
search = GridSearchCV(pipe, param_grid, scoring='roc_auc',)

search.fit(X_train, y_train)
print("Best parameter (CV score={:3f}:".format(search.best_score_))
print(search.best_params_)

See here for even more details.

And if you want to define your own scoring metric for multi-class problems rather than using AUC or some other default scoring metric, see the documentation under the scoring parameter on this page for more, but that's all I can recommend not knowing what metric you're trying to optimize.

TC Arlen
  • 1,442
  • 2
  • 11
  • 19