This is a pretty straightforward question, which I don't think I can add much than the direct question: how can I combine pipeline with cross_val_score for a multiclass problem?
I was working with a multiclass problem at work (this is why I won't share any data, but one can think of the problem as something with iris dataset), where I needed to classify some texts accordingly with the topic. This is what I was doing:
pipe = Pipeline(
steps=[
("vect", CountVectorizer()),
("feature_selection", SelectKBest(chi2, k=10)),
("reg", RandomForestClassifier()),
])
pipe.fit(X_train, y_train)
y_pred = pipe.predict(X_test)
print(classification_report(y_test, y_pred))
However, I'm a little worried about overfitting (even though I'm evaluating with the test set) and I wanted to make a more rigorous analysis and add cross validation. The problem is that I don't know how to add cross_val_score in the pipeline, neither how to evaluate a multiclass problem with cross validation. I saw this answer, and so I added this to my script:
cv = KFold(n_splits=5)
scores = cross_val_score(pipe, X_train, y_train, cv = cv)
The problem is that this results in the accuracy, which is not so good when we are discussing classification problems.
Are there any alternatives? Is it possible to make cross validation and not getting only the accuracy? Or should I stick to accuracy and this isn't a problem due to any reason?
I know the question got too 'broad', and it's actually not only about cross validation, I hope this is not an issue.
Thanks in advance