My StackingCVClassifier Has Lower Accuracy than Base Classifiers Yet Does Very Well on Test Set

Question

I built a simple Stacking Classifier with mlxtend and am trying different base classifiers and I am facing an interesting situation. From all my research it seems to me that stacking classifiers always perform better than their base classifiers.

In my case, when I cross validate the stacking classifier on the training set, I get a lower score than some of the base estimators. In addition, I often get my stacking classifier average CV score equal to the lowest of the base estimators' average CV score.

Isn't this weird? Even more strangely, once I perform a GridSearchCV on my stacking classifier, select best parameters and retrain on the entire train set, and finally calculate accuracy on the test set, I actually get a pretty good score.

I know this method is prone to leakage and there are different techniques to CV the stacking classifier, but they seem to be extremely slow and from my research the above approach seems to be ok (about this potential leakage, this Kaggle Stacking guide post even says "In practice, everyone ignores this theoretical hole (and frankly I think most people are unaware it even exists!" http://blog.kaggle.com/2016/12/27/a-kagglers-guide-to-model-stacking-in-practice/ See parameter tuning paragraph)

from mlxtend.classifier import StackingCVClassifier
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import accuracy_score

RANDOM_SEED = 12

#Imported df in separate code snippet
y = df['y']
X = df.drop(columns=['y'])

scaler = preprocessing.StandardScaler().fit(X)
X_transformed = scaler.transform(X)


X_train, X_test, y_train, y_test = train_test_split(X_transformed,y, random_state = 4)

def gridSearch_clf(clf, param_grid, X_train, y_train):
    gs = GridSearchCV(clf, param_grid).fit(X_train, y_train)
    print("Best Parameters")
    print(gs.best_params_)
    return gs.best_estimator_

def gs_report(y_test, X_test, best_estimator):
    print(classification_report(y_test, best_estimator.predict(X_test)))
    print("Overall Accuracy Score: ")
    print(accuracy_score(y_test, best_estimator.predict(X_test)))

lr = LogisticRegression()

np.random.seed(RANDOM_SEED)
sclf = StackingCVClassifier(classifiers=[best_clf1, best_clf2, best_clf3], 
                            meta_classifier=lr)

clfs = [best_clf1, best_clf2, best_clf3, sclf]
clf_names = [i.__class__.__name__ for i in clfs]

print_cv(clfs, clf_names)

Accuracy: 0.68 (+/- 0.30) [Decision Tree Classifier]
Accuracy: 0.55 (+/- 0.26) [K Neighbors Classifier]
Accuracy: 0.67 (+/- 0.32) [Bernoulli Naive Bayes]
Accuracy: 0.55 (+/- 0.26) [StackingClassifier]

## StackingClassifier Accuracy = KNN Classifier Accuracy

param_grid = {'meta-logisticregression__C':np.logspace(-2, 3, num=6, base=10)}

best_sclf = gridSearch_clf(sclf, param_grid, X_train, y_train)
gs_report(y_test,X_test, best_sclf)

Best Parameters
{'meta-logisticregression__C': 0.1}
             precision    recall  f1-score   support

          0       0.91      0.99      0.95      9131
          1       0.68      0.22      0.33      1166

avg / total       0.88      0.90      0.88     10297

Overall Accuracy Score: 
0.9000679809653297

During CV, isn't it strange how the Stacking Classifier does exactly as the lowest Base Estimator? And isn't it strange how much better it performs on the test set? Must be overfitting? — Odisseo, Jan 10 '19 at 02:15
I wish people stopped attributing all of their modeling issues to "overfitting"; overfitting is something pretty straightforward to [diagnose](https://stackoverflow.com/questions/54041867/are-my-training-and-validation-code-tensorflow-right-and-does-the-model-overfi/54042749#54042749): the validation error starts *increasing*, while the training error continues to *decrease* - period. There are several other things that may go wrong in modeling (especially in more complex scenarios like stacking) that have nothing to do with overfitting... — desertnaut, Jan 10 '19 at 10:13
That said, SO is about *programming* questions, and if things look "strange" or "weird" from a *modeling* perspective, then the question is arguably off-topic here. I would suggest moving this to [Cross Validated](https://stats.stackexchange.com/help/on-topic). — desertnaut, Jan 10 '19 at 10:17

My StackingCVClassifier Has Lower Accuracy than Base Classifiers Yet Does Very Well on Test Set

0 Answers0