sklearn.SVC returns radically different predictions (models) when utilizing GridSearchCV with small balanced dataset

Question

When utilizing sklearn.GridSearchCV over sklearn.SVC(probability=True) radically different predictions/models will be returned when the training data is small and balanced (vs. small and unbalanced). Consider this example:

from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn import svm, datasets
iris = datasets.load_iris()
# Take the first two features. We could avoid this by using a two-dim dataset
X = iris.data[:, :2]
y = iris.target

index = [0,1,2,3,51,52,53,54]
index_unequal = [0,1,2,3,51,52,53,54,55]
new_predictions = [5, 6, 7, 56, 57, 58]
pred_mat, pred_y = X[new_predictions], y[new_predictions]
c_s = [0.01, 0.1, 1.0, 10.0, 100.0]
gamma = [1e-4, 1e-3, 1e-2, 1e-1, 1, 10]
svc_params = [{'kernel': ['rbf'], 'gamma': gamma, 'C': c_s},
              {'kernel': ['linear'], 'C': c_s}]
mat, ye = X[index], y[index]
mat_unequal, y_unequal = X[index_unequal], y[index_unequal]

balanced = GridSearchCV(SVC(probability=True), svc_params, cv=4).fit(mat, ye)
unbalanced = GridSearchCV(SVC(probability=True), svc_params, cv=4).fit(mat_unequal, y_unequal)

print(balanced.predict_proba(pred_mat))
print(unbalanced.predict_proba(pred_mat))

The model trained on balanced data returns a probability of 0.5 for all new data, whereas, the model trained using the unbalanced data returns results one would typically expect. I understand that the training data used in this example is small, but with a difference of only 1, I'm curious as to what mechanism is being changed to give such radically different models/probabilities.

Update #1 After digging into this a bit more and considering Vivek's response below (thanks for the really great links!), understanding the difference between predict and predict_proba is half the battle. I could choose a scoring function for GridSearch that optimizes the probabilities and not decision function (e.g. add scoring='neg_log_loss' to my GridSearchCV call). This would give better restuls between the two models. However, I'm still curious about the outcome of the problem stated above. If you dig into the difference between the two models, the only two differences are the additional datum and the way that the cross-validation generator (i.e. cv attribute on StratifiedKFold) chooses to divvy up the data. For example, consider these stratified k-fold indicies:

balanced_cv_iter = [(np.array([1, 2, 3, 5, 6, 7]), np.array([0, 4])),
            (np.array([0, 2, 3, 4, 6, 7]), np.array([1, 5])),
            (np.array([0, 1, 3, 4, 5, 7]), np.array([2, 6])),
            (np.array([0, 1, 2, 4, 5, 6]), np.array([3, 7]))]

unbalanced_cv_iter = [(np.array([1, 2, 3, 6, 7, 8]), np.array([0, 4, 5])),
              (np.array([0, 2, 3, 4, 5, 7, 8]), np.array([1, 6])),
              (np.array([0, 1, 3, 4, 5, 6, 8]), np.array([2, 7])),
              (np.array([0, 1, 2, 4, 5, 6, 7]), np.array([3, 8]))]

balanced_cv_iter_new = [(np.array([1, 2, 3, 5, 6]), np.array([0, 4, 7])),
            (np.array([0, 2, 3, 4, 6, 7, 1]), np.array([5])),
            (np.array([0, 1, 3, 4, 5, 7, 2]), np.array([6])),
            (np.array([0, 1, 2, 4, 5, 6]), np.array([3, 7]))]

The balanced_cv_iter and unbalanced_cv_iter are two potential lists generated by the code above and related to the training/test data for the two models. However, if we alter balanced_cv_iter by making training/test to have some odd number of elements (unbalanced train/test sets), then we could get balanced_cv_iter_new. Doing so would result in predictions that are similar between both the balanced and unbalanced models. I guess the lesson here is to optimize for the intended use of the model (i.e. choose a scoring function that aligns with the use of the model)? However, if there are any additional thoughts/comments on why GridSearch chooses an SVM estimator with hyperparamters that lead to a better probabilistic model under the unbalanced framework, I would like to know.

Vivek Kumar · Answer 1 · 2017-12-01T07:58:09.577

You are looking at it wrong. This has nothing to do with GridSearchCV. But with svm.

You see, you are trying to use predict_proba() on the SVC, which is known to have confusing outputs and these outputs may not match the actual output from the predict() function.

I ran your code with one small change:

print(balanced.predict(pred_mat))
print(unbalanced.predict(pred_mat))

And the output is:

[0 0 0 1 0 1]
[0 0 0 1 1 1]

So as you see, its not much different for the two cases. The only source of difference I can think of in these two outputs is that you have 1 extra data about the second class in the second case, which helps in recognizing it better than first case. This you can verify by changing the samples of classes.

Now as to the explanation of why predict_proba is giving such results please look at:

Scikit-learn clarification about it in documentation
This answer on StackOverflow by one of the scikit developers
This excellent explanation of the differences by another scikit developer. (Do read the comments)

Vivek, thanks for taking the time to look into this problem. I've updated my post above and used your comments and links to help explain one potential solution for other users. However, I'm still curious about why an unbalanced design leads to a better probabilistic model in this case (even considering when I choose a cross validation object that induces an unbalanced design). — benneely, Dec 05 '17 at 14:19

sklearn.SVC returns radically different predictions (models) when utilizing GridSearchCV with small balanced dataset

1 Answers1