When utilizing sklearn.GridSearchCV over sklearn.SVC(probability=True) radically different predictions/models will be returned when the training data is small and balanced (vs. small and unbalanced). Consider this example:
from sklearn.model_selection import GridSearchCV
from sklearn.svm import SVC
from sklearn import svm, datasets
iris = datasets.load_iris()
# Take the first two features. We could avoid this by using a two-dim dataset
X = iris.data[:, :2]
y = iris.target
index = [0,1,2,3,51,52,53,54]
index_unequal = [0,1,2,3,51,52,53,54,55]
new_predictions = [5, 6, 7, 56, 57, 58]
pred_mat, pred_y = X[new_predictions], y[new_predictions]
c_s = [0.01, 0.1, 1.0, 10.0, 100.0]
gamma = [1e-4, 1e-3, 1e-2, 1e-1, 1, 10]
svc_params = [{'kernel': ['rbf'], 'gamma': gamma, 'C': c_s},
{'kernel': ['linear'], 'C': c_s}]
mat, ye = X[index], y[index]
mat_unequal, y_unequal = X[index_unequal], y[index_unequal]
balanced = GridSearchCV(SVC(probability=True), svc_params, cv=4).fit(mat, ye)
unbalanced = GridSearchCV(SVC(probability=True), svc_params, cv=4).fit(mat_unequal, y_unequal)
print(balanced.predict_proba(pred_mat))
print(unbalanced.predict_proba(pred_mat))
The model trained on balanced data returns a probability of 0.5
for all new data, whereas, the model trained using the unbalanced data returns results one would typically expect. I understand that the training data used in this example is small, but with a difference of only 1, I'm curious as to what mechanism is being changed to give such radically different models/probabilities.
Update #1
After digging into this a bit more and considering Vivek's response below (thanks for the really great links!), understanding the difference between predict
and predict_proba
is half the battle. I could choose a scoring function for GridSearch that optimizes the probabilities and not decision function (e.g. add scoring='neg_log_loss'
to my GridSearchCV
call). This would give better restuls between the two models. However, I'm still curious about the outcome of the problem stated above. If you dig into the difference between the two models, the only two differences are the additional datum and the way that the cross-validation generator (i.e. cv attribute on StratifiedKFold
) chooses to divvy up the data. For example, consider these stratified k-fold indicies:
balanced_cv_iter = [(np.array([1, 2, 3, 5, 6, 7]), np.array([0, 4])),
(np.array([0, 2, 3, 4, 6, 7]), np.array([1, 5])),
(np.array([0, 1, 3, 4, 5, 7]), np.array([2, 6])),
(np.array([0, 1, 2, 4, 5, 6]), np.array([3, 7]))]
unbalanced_cv_iter = [(np.array([1, 2, 3, 6, 7, 8]), np.array([0, 4, 5])),
(np.array([0, 2, 3, 4, 5, 7, 8]), np.array([1, 6])),
(np.array([0, 1, 3, 4, 5, 6, 8]), np.array([2, 7])),
(np.array([0, 1, 2, 4, 5, 6, 7]), np.array([3, 8]))]
balanced_cv_iter_new = [(np.array([1, 2, 3, 5, 6]), np.array([0, 4, 7])),
(np.array([0, 2, 3, 4, 6, 7, 1]), np.array([5])),
(np.array([0, 1, 3, 4, 5, 7, 2]), np.array([6])),
(np.array([0, 1, 2, 4, 5, 6]), np.array([3, 7]))]
The balanced_cv_iter
and unbalanced_cv_iter
are two potential lists generated by the code above and related to the training/test data for the two models. However, if we alter balanced_cv_iter
by making training/test to have some odd number of elements (unbalanced train/test sets), then we could get balanced_cv_iter_new
. Doing so would result in predictions that are similar between both the balanced and unbalanced models. I guess the lesson here is to optimize for the intended use of the model (i.e. choose a scoring function that aligns with the use of the model)? However, if there are any additional thoughts/comments on why GridSearch chooses an SVM estimator with hyperparamters that lead to a better probabilistic model under the unbalanced framework, I would like to know.