Reproducing Sklearn SVC within GridSearchCV's roc_auc scores manually

Question

I would like to be able to reproduce sklearn SelectKBest results when using GridSearchCV by performing the grid-search CV myself. However, I find my code to produce different results. Here is a reproducible example:

import numpy as np
from sklearn.datasets import make_classification
from sklearn.feature_selection import SelectKBest
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import roc_auc_score
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC
import itertools

r = 1
X, y = make_classification(n_samples = 50, n_features = 20, weights = [3/5], random_state = r)
np.random.seed(r)
X = np.random.rand(X.shape[0], X.shape[1])

K = [1,3,5]
C = [0.1,1]
cv = StratifiedKFold(n_splits = 10)
space = dict()
space['anova__k'] = K
space['svc__C'] = C    
clf = Pipeline([('anova', SelectKBest()), ('svc', SVC(probability = True, random_state = r))])
search = GridSearchCV(clf, space, scoring = 'roc_auc', cv = cv, refit = True, n_jobs = -1)
result = search.fit(X, y)

print('GridSearchCV results:')
print(result.cv_results_['mean_test_score'])

scores = []
for train_indx, test_indx in cv.split(X, y):
    X_train, y_train = X[train_indx,:], y[train_indx]
    X_test, y_test = X[test_indx,:], y[test_indx]
    scores_ = []
    for k, c in itertools.product(K, C):
        anova = SelectKBest(k = k)
        X_train_k = anova.fit_transform(X_train, y_train)
        clf = SVC(C = c, probability = True, random_state = r).fit(X_train_k, y_train)
        y_pred = clf.predict_proba(anova.transform(X_test))[:, 1]
        scores_.append(roc_auc_score(y_test, y_pred))
    scores.append(scores_)
    
print('Manual grid-search CV results:')    
print(np.mean(np.array(scores), axis = 0))

For me, this produces the following output:

GridSearchCV results:
[0.41666667 0.4        0.4        0.4        0.21666667 0.26666667]
Manual grid-search CV results:
[0.58333333 0.6        0.53333333 0.46666667 0.48333333 0.5       ]

when using the make_classification dataset directly, the output matches. On the other hand, when X is computed based on np.random.rand, the scores differ.

Is there some random process that I am not aware of underneath?

Can anyone please help with this? – User User Mar 12 '21 at 20:12 — User User, Mar 12 '21 at 20:12

Mark H · Accepted Answer · 2021-03-15T17:45:57.327

Edit: restructured my answer, since it seems you are after more of a "why?" and "how should I?" vs a "how can I?"

The Issue

The scorer that you're using in GridSearchCV isn't being passed the output of predict_proba like it is in your loop version. It's being passed the output of decision_function. For SVM's the argmax of the probabilities may differ from the decisions, as described here:

The cross-validation involved in Platt scaling is an expensive operation for large datasets. In addition, the probability estimates may be inconsistent with the scores:

the “argmax” of the scores may not be the argmax of the probabilities

in binary classification, a sample may be labeled by predict as belonging to the positive class even if the output of predict_proba is less than 0.5; and similarly, it could be labeled as negative even if the output of predict_proba is more than 0.5.

How I would Fix It

Use SVC(probability = False, ...) in both the Pipeline/GridSearchCV approach and the loop, and decision_function in the loop instead of predict_proba. According to this blurb above, this will also speed up your code.

My Original, Literal Answer to Your Question

To make your loop match GridSearchCV, leaving the GridSearchCV approach alone:

y_pred = clf.decision_function(anova.transform(X_test)) # instead of predict_proba

To make GridSearchCV match your loop, leaving the loop code alone:

from sklearn.metrics import make_scorer
roc_auc_scorer = make_scorer(roc_auc_score, greater_is_better=True, needs_proba=True)
search = GridSearchCV(clf, space, scoring = roc_auc_scorer, cv = cv, refit = True, n_jobs = -1)

I see, this is correct, thank you. However, how is this not an issue? I would assume that a ROC-AUC score (given the same data, same machine, etc.) would be the same regardless the method used. Say I have trained two classifiers on the same data, on the same machine, using those two methods: can I compare their ROC-AUC scores against each other? Or do I have to re-do the training for one of them. And if so, which one is the "correct" one? — User User, Mar 15 '21 at 12:25
The ROC-AUC scorer is at the mercy of the data you give it. Based on this article: https://scikit-learn.org/stable/modules/svm.html#scores-probabilities I would set probabilities=False and use decision_function instead. No idea if your past training is equivalent or not, because, as you demonstrated, sometimes it matches and sometimes it doesn't. I was able to demonstrate a period of matching and period of non-matching by slowing adding noise to X instead of just completely replacing it with random values. — Mark H, Mar 15 '21 at 15:52

score 2 · Answer 2 · answered Mar 15 '21 at 12:42

The key difference between your implementation and the way GridSearchCV operates is that

GridSearchCV uses decision_function method for computing the roc_auc.
In your implementation, predict_proba was used.

Just change the following line:

        y_pred = clf.decision_function(anova.transform(X_test))

You will get the same results for both the ways after that.

GridSearchCV results:
[0.41666667 0.4        0.4        0.4        0.21666667 0.26666667]
Manual grid-search CV results:
[0.41666667 0.4        0.4        0.4        0.21666667 0.26666667]

More explanation about the scoring in GridSearchCV here.

This inconsistency is documented in the SVC's probability parameter:

probability bool, default=False

Whether to enable probability estimates. This must be enabled prior to calling fit, will slow down that method as it internally uses 5-fold cross-validation, and predict_proba may be inconsistent with predict. Read more in the User Guide.

This probably is the reason for why there is no difference when using make_classification dataset. Meaning, the 5-fold cv based probability estimation would be similar to predict_proba output because the Xs are taken from Gaussian distribution. Whereas in np.random.rand(), the 5-fold based estimation might have given completely different estimates.

This is correct, thank you. So say I am comparing two classifiers trained on the same data and same machine, using the two methods above, by comparing the respective ROC-AUC scores. Is this wrong? If so, and I have to repeat the training for one of them, which one is the "correct" one? — User User, Mar 15 '21 at 12:48
Based on the recommendation from Sklearn, if your metric just requires some measure on likelihood, then go with decision_function. This would give you computational advantage as well. — Venkatachalam, Mar 15 '21 at 12:50

Reproducing Sklearn SVC within GridSearchCV's roc_auc scores manually

2 Answers2

The Issue

How I would Fix It

My Original, Literal Answer to Your Question

Linked