46

I am using sklearn.svm.svc from scikit-learn to do binary classification. I am using its predict_proba() function to get probability estimates. Can anyone tell me how predict_proba() internally calculates the probability?

unthought
  • 651
  • 1
  • 15
  • 24
user2115183
  • 851
  • 2
  • 9
  • 13

2 Answers2

78

Scikit-learn uses LibSVM internally, and this in turn uses Platt scaling, as detailed in this note by the LibSVM authors, to calibrate the SVM to produce probabilities in addition to class predictions.

Platt scaling requires first training the SVM as usual, then optimizing parameter vectors A and B such that

P(y|X) = 1 / (1 + exp(A * f(X) + B))

where f(X) is the signed distance of a sample from the hyperplane (scikit-learn's decision_function method). You may recognize the logistic sigmoid in this definition, the same function that logistic regression and neural nets use for turning decision functions into probability estimates.

Mind you: the B parameter, the "intercept" or "bias" or whatever you like to call it, can cause predictions based on probability estimates from this model to be inconsistent with the ones you get from the SVM decision function f. E.g. suppose that f(X) = 10, then the prediction for X is positive; but if B = -9.9 and A = 1, then P(y|X) = .475. I'm pulling these numbers out of thin air, but you've noticed that this can occur in practice.

Effectively, Platt scaling trains a probability model on top of the SVM's outputs under a cross-entropy loss function. To prevent this model from overfitting, it uses an internal five-fold cross validation, meaning that training SVMs with probability=True can be quite a lot more expensive than a vanilla, non-probabilistic SVM.

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • 2
    Great answer @larsmans. I'm just wondering if the probabilities can be interpreted as a confidence measure for the classification decisions? E.g. very close probabilities for positive and negative classes for a sample means the learner is less sure about its classification? – Moses Xu Feb 28 '13 at 01:20
  • 2
    Thanks @larsmans. I've actually observed much more dramatic cases -- predictions of 1, but with probability 0.45. I thought that the bayes optimal cutoff used is 0.5 precisely. Do you reckon that such dramatic cases can still be explained by the numerical instability in LibSVM? – Moses Xu Mar 09 '13 at 08:07
  • 1
    @MosesXu: this is something worth investigating, but I don't have the time to dig into the LibSVM code ATM. It seems to be inconsistent behavior at first sight, but I think `predict` does not actually use the probabilities, but rather the SVM hyperplane. – Fred Foo Mar 09 '13 at 13:46
  • 2
    @MosesXu: I stared at the math a little longer and I realized that with an appropriate value of `B`, you can get predictions that are really different from the ones you get from the SVM `predict` and `decision_function` methods. I fear that when you use Platt scaling, you'll have to commit yourself to either believing `predict`, or believing `predict_proba`, as the two may be inconsistent. – Fred Foo Mar 09 '13 at 13:50
  • 1
    @larsmans: it is somewhat surprising that the predict function always sticks to the hyperplane regardless of the probability parameter -- is this because the learned hyperplane always represents minimum structural risk while the fitted logistic regression, though fitted using n-fold cross validation, is still prone to over fitting? – Moses Xu Mar 12 '13 at 03:42
  • 1
    @MosesXu: I have no rationale for this behavior except that it is what LibSVM does, and scikit-learn tries to stay compatible with that. A possible reason might be, though, that `probability=True` does not affect the outcome of `decision_function`, so there's going to be an inconsistency either way. (The more I think about this, the more I become convinced that Platt scaling is just a hack and RVMs should be used instead of SVMs for probability estimates.) – Fred Foo Mar 12 '13 at 13:12
  • @AndreasMueller: [it's already there](https://github.com/scikit-learn/scikit-learn/commit/5eb035c7e7a7dc2dfe9f6372ab428f8206bf0583), in the dev version. – Fred Foo May 07 '13 at 15:20
  • @FredFoo One question, does predict_proba always give the same output probabilities given a test set???? I was debugging my code for almost 2 days and finally observed that predict_proba DO NOT always give the same output...why would this happen? – RockTheStar Jan 06 '17 at 02:31
  • 1
    @RockTheStar I think this might b because it's using 5 fold CV to estimate probability. If you randomise sequence before running the SVM it will produce slightly different results each time unless you set the seed. – user2259664 Oct 22 '18 at 04:19
  • For future reference to @FredFoo's remark about unrepeatable results from `predict_proba`, (at) RockTheStar is correct, but you can make your result repeatable by setting `random_state=0` in the SVC constructor (see: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html). Note that this has the potential to 'paper over' real issues with the model or training set, so you should proceed with caution. – sg_man Feb 05 '20 at 21:47
-1

Actually I found a slightly different answer that they used this code to convert decision value to probability

'double fApB = decision_value*A+B;
if (fApB >= 0)
    return Math.exp(-fApB)/(1.0+Math.exp(-fApB));
else
     return 1.0/(1+Math.exp(fApB)) ;'

Here A and B values can be found in the model file (probA and probB). It offers a way to convert probability to decision value and thus to hinge loss.

Use that ln(0) = -200.

user1165814
  • 405
  • 4
  • 5