3

I came up with following issue when I try to extract the predicted probabilities using support vector machine (SVM). Usually the probability cutoff for a classification algorithm is 0.5. But I need to analysis how the accuracy changes with the probability cutoff for SVM machine learning algorithm.

I used caret package in R with Leave one out cross validation(LOOCV)

First I fitted regular svm model without extracting the class probabilities. So it will only store the predicted class labels.

data source : https://www.kaggle.com/uciml/pima-indians-diabetes-database

require(caret)
set.seed(123)
diabetes <- read.csv("C:/Users/Downloads/228_482_bundle_archive/diabetes.csv")
fitControl1 <- trainControl( method = "LOOCV",savePredictions = T,search = "random")
diabetes$Outcome=factor(diabetes$Outcome)
modelFitlassocvintm1 <- train((Outcome) ~ Pregnancies+BloodPressure+Glucose +
                                BMI+DiabetesPedigreeFunction +Age
                              , data=diabetes, 
                              method = "svmRadialSigma", 
                              trControl = fitControl1,
                              preProcess = c("center", "scale"),
                              tuneGrid=expand.grid(
                                .sigma=0.004930389,
                                .C=9.63979626))

To extract the predicted probabilities, I need to specify classProbs = T inside the trainControl.

set.seed(123)
fitControl2 <- trainControl( method = "LOOCV",savePredictions = T,classProbs = T)
diabetes$Outcome=factor(diabetes$Outcome)
modelFitlassocvintm2 <- train(make.names(Outcome) ~ Pregnancies+BloodPressure+Glucose +
                                BMI+DiabetesPedigreeFunction +Age
                              , data=diabetes, 
                              method = "svmRadialSigma", 
                              trControl = fitControl2,
                              preProcess = c("center", "scale"),
                              tuneGrid=expand.grid(
                                .sigma=0.004930389,
                                .C=9.63979626))

The only difference in modelFitlassocvintm1 and modelFitlassocvintm2 is the inclusion of classProbs = T inside the trainControl.

If I compare the predicted classes of modelFitlassocvintm1 and modelFitlassocvintm2 , it should be same under 0.5 probability cutoff. But it is not the case.

table(modelFitlassocvintm2$pred$X1 >0.5,modelFitlassocvintm1$pred$pred)
       
          0   1
  FALSE 560   0
  TRUE    8 200

Then when I further investigate this 8 values which are different, I got following results.

subs1=cbind(modelFitlassocvintm2$pred$X1,modelFitlassocvintm2$pred$pred,modelFitlassocvintm1$pred$pred)
subset(subs1,subs1[,2]!=subs1[,3])
          [,1] [,2] [,3]
[1,] 0.5078631    2    1
[2,] 0.5056252    2    1
[3,] 0.5113336    2    1
[4,] 0.5048708    2    1
[5,] 0.5033003    2    1
[6,] 0.5014327    2    1
[7,] 0.5111975    2    1
[8,] 0.5136453    2    1

It seems that, when the predicted probability is close to 0.5 , there is a discrepancy in the predicted class in modelFitlassocvintm1 and modelFitlassocvintm2. And I saw a similar discrepancy for svm using a different data set also.

What may be the reason for this? Cant we trust the predicted probabilities from svm ? Usually , svm classifies a subject as -1 or 1 , depending on side it lies with respect to the hyperplane. So it not a good thing to rely on the predicted probabilities for svm?

student_R123
  • 962
  • 11
  • 30
  • 3
    SVMs are not probabilistic classifiers; they do not actually produce probabilities. – desertnaut Sep 05 '20 at 09:36
  • @desertnaut So how about the ROC curves generated from svm ? can we trust them ? I saw some famous machine learning books like ISLR included roc curves generated from svm. – student_R123 Sep 05 '20 at 15:12
  • Not quite sure about that, have to check it – desertnaut Sep 05 '20 at 16:30
  • I am not sure. But I think your argument is valid for that also. Roc curve is a plot of sensitivity vs 1- specificity for varying cutoffs. What I am exploring is a accuracy for varying cutoffs – student_R123 Sep 05 '20 at 18:12

1 Answers1

2

As noted in the comments by desertnaut, SVMs are not probabilistic classifiers; they do not actually produce probabilities.

One method to create probabilities is to directly train a kernel classifier with a logit link function and a regularized maximum likelihood score. However, training with a maximum likelihood score will produce non-sparse kernel machines. Instead, after training a SVM, parameters of an additional sigmoid function are trained to map the SVM outputs into probabilities. Reference paper: Probabilistic Outputs for Support Vector Machines and Comparisons to Regularized Likelihood Methods

Caret method = "svmRadialSigma" uses internally kernlab::ksvm with the argument kernel = "rbfdot". In order for this function to create probabilities the argument prob.model = TRUE is needed. From the help of this function:

prob.model if set to TRUE builds a model for calculating class probabilities or in case of regression, calculates the scaling parameter of the Laplacian distribution fitted on the residuals. Fitting is done on output data created by performing a 3-fold cross-validation on the training data. For details see references. (default: FALSE)

The refereed details:

In classification when prob.model is TRUE a 3-fold cross validation is performed on the data and a sigmoid function is fitted on the resulting decision values f.

It is clear that something very specific is happening for classification models when posterior probabilities are needed. This is different compared to just outputting decision values.

From this it can be derived that depending on the sigmoid function fit some of the decision values can be different compared to when running [kernlab::ksvm] without prob.model (prob.model = FALSE) and this is what you are observing in the posted example.

Things get even more complicated if there are more then two classes.

Further reading:

Including class probabilities might skew a model in caret?

Isn't caret SVM classification wrong when class probabilities are included?

Why are probabilities and response in ksvm in R not consistent?

[R] Inconsistent results between caret+kernlab versions

missuse
  • 19,056
  • 3
  • 25
  • 47