0

I want to perform probabilistic binary classification (0,1). My dataset is imbalanced.Thus, I use SVC with some class weights assigned to each class.

After fitting SVC to the test dataset, I use predict_proba to get the probabilistic classification results. However, SVC predicts training examples as 1 with probabilistic classification results higher than 0.4.

I think the default threshold for predict_proba is 0.5.

I wonder that in case of using class_weights, does default threshold chage automatically?

Ex:

[0.58497606, 0.41502394] >> The predicted label for result of predict_probaba function is 1.

  • 1
    Be aware anyway that in binary classification pbs (not necessarily imbalanced) [this](https://stackoverflow.com/questions/68475534/svm-model-predicts-instances-with-probability-scores-greater-than-0-1default-th/70049005#70049005) may happen with `SVC()` and in general with non probabilistic classifiers. – amiola Dec 21 '21 at 13:59
  • 1
    Please show, do not tell - post a [mre]. – desertnaut Dec 21 '21 at 14:03
  • As added in the answer below, there is a warning about possible inconsistent results in the doc : https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html. – Malo Dec 22 '21 at 09:18

1 Answers1

0

The probabilities matrix:

First colum in the probability of being class 0, and second column is the probability of beeing class 1. These are probabilities before choosing a threshold.

class_weights

It is used when you have imbalanced data in your training data. Example when you have 100 class 0 and 10 class 1, this imbalanced can be taken into account with class_weights='balanced' parameter

Threshold

It is set to 0.5 be default. But you can compte your own based on the probabilities matrix you get. You have to calculate it, as there is no way to change it inside the SVM class directly.

Inconsistencies between predict and predict_proba

The doc https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVC.html says that the result of the two methods can be inconsistent beacause of the way they are implemented:

"Whether to enable probability estimates. This must be enabled prior to calling fit, will slow down that method as it internally uses 5-fold cross-validation, and predict_proba may be inconsistent with predict. Read more in the User Guide."

Malo
  • 1,233
  • 1
  • 8
  • 25
  • I tried class_weight as "balanced". However, this affected my metrics badly. What I wonder is that why the label is 1 when the probabilities are lower than 0.5 even tough no change in default threshold. – Bengu Atici Dec 21 '21 at 13:53
  • Maybee the order of the labels [0, 1] or [1, 0] may be taken into account in the final probability matrix. Moreover the doc states that "predict_proba may be inconsistent with predict.". Il add this in the answer – Malo Dec 22 '21 at 09:13