12

I am using ksvm from the kernlab package in R to predict probabilities, using the type="probabilities" option in predict.ksvm. However, I find that sometimes using predict(model,observation,type="r") yields not the class with the highest probability given by predict(model,observation,type="p").

Example:

> predict(model,observation,type="r")
[1] A
Levels: A B
> predict(model,observation,type="p")
        A    B
[1,] 0.21 0.79

Is this correct behavior, or a bug? If it is correct behavior, how can I estimate the most likely class from the probabilities?


Attempt at reproducible example:

library(kernlab)
set.seed(1000)
# Generate fake data
n <- 1000
x <- rnorm(n)
p <- 1 / (1 + exp(-10*x))
y <- factor(rbinom(n, 1, p))
dat <- data.frame(x, y)
tmp <- split(dat, dat$y)
# Create unequal sizes in the groups (helps illustrate the problem)
newdat <- rbind(tmp[[1]][1:100,], tmp[[2]][1:10,])
# Fit the model using radial kernal (default)
out <- ksvm(y ~ x, data = newdat, prob.model = T)
# Create some testing points near the boundary

testdat <- data.frame(x = seq(.09, .12, .01))
# Get predictions using both methods
responsepreds <- predict(out, newdata = testdat, type = "r")
probpreds <- predict(out, testdat, type = "p")

results <- data.frame(x = testdat, 
                      response = responsepreds, 
                      P.x.0 = probpreds[,1], 
                      P.x.1 = probpreds[,2])

Output of results:

> results
     x response     P.x.0     P.x.1
1 0.09        0 0.7199018 0.2800982
2 0.10        0 0.6988079 0.3011921
3 0.11        1 0.6824685 0.3175315
4 0.12        1 0.6717304 0.3282696
Dason
  • 60,663
  • 9
  • 131
  • 148
roelandvanbeek
  • 659
  • 8
  • 20
  • It would help us a lot if you gave us some sample data, and a sample fitted model. It would make it to easier to explain this behaviour. In short, however, it is probably correct behaviour. – nograpes Mar 19 '13 at 15:10
  • Probably no one is likely to agree this is a bug unless you provide reproducible code that generates this behavior. Otherwise, the Bayesians among us will simply say that their prior of "You made a mistake" is probably correct. – joran Mar 19 '13 at 15:11
  • My question is primarily whether this should at all be possible, but I understand that a reproducible example would be helpful. I will try to train a sample model with this behavior, but this occurred with a model that is much too large for SO, unfortunately. I will let you know. – roelandvanbeek Mar 19 '13 at 15:25
  • @joran That behavior is reproducible. I'll add some code for reproducibility. roelandvanbeek - If the code I add doesn't illustrate what you're talking about feel free to remove it. – Dason Mar 22 '13 at 20:27
  • It appears that this behavior is exacerbated by unequal sample sizes in the data used to fit the model. – Dason Mar 22 '13 at 20:35
  • @Dason Interesting. I would vote for either a bug, or we aren't understanding some of the details of the methodology. What happens if you use the `class.weights` argument to match the unbalanced class weights? – joran Mar 22 '13 at 20:38
  • Thank you for your help, @Dason, this behavior is exactly what I meant. – roelandvanbeek Mar 25 '13 at 07:54
  • @joran Different `class.weights` move the boundaries, but the problem persists. – roelandvanbeek Mar 25 '13 at 08:22

1 Answers1

14

If you look at the decicision matrix and votes, they seem to be more in line with the responses:

> predict(out, newdata = testdat, type = "response")
[1] 0 0 1 1
Levels: 0 1
> predict(out, newdata = testdat, type = "decision")
            [,1]
[1,] -0.07077917
[2,] -0.01762016
[3,]  0.02210974
[4,]  0.04762563
> predict(out, newdata = testdat, type = "votes")
     [,1] [,2] [,3] [,4]
[1,]    1    1    0    0
[2,]    0    0    1    1
> predict(out, newdata = testdat, type = "prob")
             0         1
[1,] 0.7198132 0.2801868
[2,] 0.6987129 0.3012871
[3,] 0.6823679 0.3176321
[4,] 0.6716249 0.3283751

The kernlab help pages (?predict.ksvm) link to paper Probability estimates for Multi-class Classification by Pairwise Coupling by T.F. Wu, C.J. Lin, and R.C. Weng.

In section 7.3 it is said that the decisions and probabilities can differ:

...We explain why the results by probability-based and decision-value-based methods can be so distinct. For some problems, the parameters selected by δDV are quite different from those by the other five rules. In waveform, at some parameters all probability-based methods gives much higher cross validation accuracy than δDV . We observe, for example, the decision values of validation sets are in [0.73, 0.97] and [0.93, 1.02] for data in two classes; hence, all data in the validation sets are classified as in one class and the error is high. On the contrary, the probability-based methods fit the decision values by a sigmoid function, which can better separate the two classes by cutting at a decision value around 0.95. This observation shed some light on the difference between probability-based and decision-value based methods...

I'm not familiar enough with these methods to understand the issue, but maybe you do, It looks like that there is distinct methods for predicting with probabilities and some other method, and the type=response corresponds to different method than the one which is used for prediction probabilities.

Jouni Helske
  • 6,427
  • 29
  • 52
  • Nice detective work! The link sends me to a directory with lots of pdfs in it. Can you be more specific about which one is the actual paper you found? – joran Mar 28 '13 at 17:00
  • Oops, the link was cut off when copy-pasting, I now corrected it to point to straight to the actual paper. – Jouni Helske Mar 28 '13 at 17:07