what is the "random" or non-deterministic factor inside SVM prediction by probabilities in e1071 in R?

Question

I'm new to SVM and e1071. I found that the results are different every time I run the exact same code.

For example:

data(iris)
library(e1071)

model <- svm(Species ~ ., data = iris[-150,], probability = TRUE)
pred <- predict(model, iris[150,-5], probability = TRUE)
result1 <- as.data.frame(attr(pred, "probabilities"))

model <- svm(Species ~ ., data = iris[-150,], probability = TRUE)
pred <- predict(model, iris[150,-5], probability = TRUE)
result2 <- as.data.frame(attr(pred, "probabilities"))

then I got result1 as:

         setosa versicolor virginica
150 0.009704854  0.1903696 0.7999255

and result2 as:

        setosa versicolor virginica
150 0.01006306  0.1749947 0.8149423

and the result keeps change every round.

Here I'm using the first 149 rows as a training set and the last row as testing. The probabilities for each classes in result1 and result2 are not exactly the same. I'm guessing there is some process during the prediction that is "random". How is this happening?

I'm aware that the predicted probabilities can be fixed if I set.seed() with the same number before each call. I'm not "aiming" for a fixed prediction result, but just curious why this happens and what steps it takes to generate the probabilities prediction.

The slight difference doesn't really have a big impact on the iris data, since the last sample would still be predicted as "virginica". But when my data (with two classes A and B) is not that "good", and an unknown sample is predicted to have probability of 0.489 and 0.521 for two times of being class A, it will be confusing.

Thanks!

IRTFM · Answer 1 · 2016-05-21T19:31:14.037

4

SVM uses a cross-validation step in developing the estimates of probabilities. The source code for that step starts with:

// Cross-validation decision values for probability estimates
static void svm_binary_svc_probability(
    const svm_problem *prob, const svm_parameter *param,
    double Cp, double Cn, double& probA, double& probB)
{
    int i;
    int nr_fold = 5;
    int *perm = Malloc(int,prob->l);
    double *dec_values = Malloc(double,prob->l);

    // random shuffle
    GetRNGstate();
    for(i=0;i<prob->l;i++) perm[i]=i;
    for(i=0;i<prob->l;i++)
    {
        int j = i+((int) (unif_rand() * (prob->l-i))) % (prob->l-i);
        swap(perm[i],perm[j]);
    }

You can create "predictability" by setting the random seed just before the call:

> data(iris)
> library(e1071)
> set.seed(123)
> model <- svm(Species ~ ., data = iris[-150,], probability = TRUE)
> pred <- predict(model, iris[150,-5], probability = TRUE)
> result1 <- as.data.frame(attr(pred, "probabilities"))
> set.seed(123)
> model <- svm(Species ~ ., data = iris[-150,], probability = TRUE)
> pred <- predict(model, iris[150,-5], probability = TRUE)
> result2 <- as.data.frame(attr(pred, "probabilities"))
> result1
         setosa versicolor virginica
150 0.009114718  0.1734126 0.8174727
> result2
         setosa versicolor virginica
150 0.009114718  0.1734126 0.8174727

But I am reminded of the epigram from Emerson: "A foolish consistency is the hobgoblin of little minds."

edited May 21 '16 at 19:31

answered May 21 '16 at 18:23

IRTFM

258,963
21
364
487

Thanks for your answer! Could you please explain in more detail how svm from e1071 generate the probability by a cross-validation, so that there is a random process? I haven't found what you posted in the source code, and I don't completely understand the code you posted. I know I can set seed before the call to make it consistent. I was not going for a consistent result, but I just want to understand how that happens. Thanks a lot!! – Yan May 23 '16 at 14:06
After getting the source version of the package in the zipped version whose link appears above you expand and untar it. There you should see a folder named 'src'. The code I extracted is in the file named svm.cpp. It should be easy to find with a text editor. I'm not a C++ programmer, either, and I think asking for a walkthrough of code neither one of us is likely to ever modify is too big an "ask" for an SO question. – IRTFM May 23 '16 at 16:29
Sorry if my question was inappropriate. But I wasn't asking for a walkthrough of the code. I was just wondering if you could give me a one or two sentences explanation of "SVM uses a cross-validation step in developing the estimates of probabilities", i.e. how this cross-validation step is done to develop the probabilities here. I'm not a programmer and new to coding, it's hard for me to find the answer by myself through reading the raw code. I would appreciate if you could explain that without reading the code. But if it takes up your time, I'm sorry. Thanks! – Yan May 23 '16 at 18:46
Maybe reading this FAQ for libSVM will help: http://www.csie.ntu.edu.tw/~cjlin/libsvm/faq.html#/Q06:_Probability_outputs – IRTFM May 23 '16 at 21:21
1

Thanks for providing the FAQ link! It does give me more information that I wan't aware of about SVM. It explains that it's longer to generate a probability rather than a class because they include a cross-validation step to generate probability. I can understand if a cross-validation is involved, it will take a longer time. But how is it involved? I still don't know. Sorry. I tried reading the raw code svm.R from the e1071 package, but I haven't figured out. I don't want to bother you to walk through that. But if someone knows the answer, I appreciate your help. Thanks. – Yan May 24 '16 at 01:43
@IRTFM I found a similar kind of issue after extracting the predicted probabilities for a SVM classifier using caret package : https://stackoverflow.com/questions/63749263/different-results-for-svm-in-r .I think the caret package uses the same e1071 package for SVM. Is it possible to help with this ? Thank you – student_R123 Sep 13 '20 at 19:34

what is the "random" or non-deterministic factor inside SVM prediction by probabilities in e1071 in R?

1 Answers1