1

I'm using the mlr package's framework to build a svm model to predict landcover classes in an image. I used the raster package's predict function and also converted the raster to a dataframe and then predicted on that dataframe using the "learner.model" as input. These methods gave me realistic results.

Work well:

> predict(raster, mod$learner.model)

or

> xy <- as.data.frame(raster, xy = T)

> C <- predict(mod$learner.model, xy)

However, if I predict on the dataframe derived from the raster without specifying the learner.model, the results are not the same.

> C2 <- predict(mod, newdata=xy)

C2$data$response is not identical to C. Why?


Here is a reproducible example that demonstrates the problem:

> library(mlr)
 > library(kernlab)
 > x1 <- rnorm(50)
 > x2 <- rnorm(50, 3)
 > x3 <- rnorm(50, -20, 3)
 > C <- sample(c("a","b","c"), 50, T)
 > d <-  data.frame(x1, x2, x3, C)
 > classif <- makeClassifTask(id = "example", data = d, target = "C")
 > lrn <- makeLearner("classif.ksvm", predict.type = "prob", fix.factors.prediction = T)
 > t <- train(lrn, classif)

 Using automatic sigma estimation (sigest) for RBF or laplace kernel

 > res1 <- predict(t, newdata = data.frame(x2,x1,x3))
 > res1

 Prediction: 50 observations
 predict.type: prob
 threshold: a=0.33,b=0.33,c=0.33
 time: 0.01
      prob.a    prob.b    prob.c response
 1 0.2110131 0.3817773 0.4072095        c
 2 0.1551583 0.4066868 0.4381549        c
 3 0.4305353 0.3092737 0.2601910        a
 4 0.2160050 0.4142465 0.3697485        b
 5 0.1852491 0.3789849 0.4357659        c
 6 0.5879579 0.2269832 0.1850589        a

 > res2 <- predict(t$learner.model, data.frame(x2,x1,x3))
 > res2
  [1] c c a b c a b a c c b c b a c b c a a b c b c c a b b b a a b a c b a c c c
 [39] c a a b c b b b b a b b
 Levels: a b c
!> res1$data$response == res2
  [1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE FALSE
 [13]  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE
 [25]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE
 [37]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
 [49]  TRUE FALSE

The predictions are not identical. Following mlr's tutorial page on prediction, I don't see why the results would differ. Thanks for your help.

-----

Update: When I do the same with a random forest model, the two vectors are equal. Is this because SVM is scale dependent and random forest is not?

 > library(randomForest)

 > classif <- makeClassifTask(id = "example", data = d, target = "C")
 > lrn <- makeLearner("classif.randomForest", predict.type = "prob", fix.factors.prediction = T)
 > t <- train(lrn, classif)
 >
 > res1 <- predict(t, newdata = data.frame(x2,x1,x3))
 > res1
 Prediction: 50 observations
 predict.type: prob
 threshold: a=0.33,b=0.33,c=0.33
 time: 0.00
   prob.a prob.b prob.c response
 1  0.654  0.228  0.118        a
 2  0.742  0.090  0.168        a
 3  0.152  0.094  0.754        c
 4  0.092  0.832  0.076        b
 5  0.748  0.100  0.152        a
 6  0.680  0.098  0.222        a
 >
 > res2 <- predict(t$learner.model, data.frame(x2,x1,x3))
 > res2
  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26
  a  a  c  b  a  a  a  c  a  b  b  b  b  c  c  a  b  b  a  c  b  a  c  c  b  c
 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
  a  a  b  a  c  c  c  b  c  b  c  a  b  c  c  b  c  b  c  a  c  c  b  b
 Levels: a b c
 >
 > res1$data$response == res2
  [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [16] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [31] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
 [46] TRUE TRUE TRUE TRUE TRUE

----

Another Update: If I change predict.type to "response" from "prob", the two svm prediction vectors agree with each other. I'm going to look into the differences of these types, I had thought that "prob" gave the same results but also gave probabilities. Maybe this isn't the case?

Lars Kotthoff
  • 107,425
  • 16
  • 204
  • 204
Tedward
  • 1,652
  • 15
  • 19
  • Please provide a [reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input so we can test the code to see what might be going on. – MrFlick Jul 31 '15 at 20:20
  • Sure thing, I'll update in a few minutes. – Tedward Jul 31 '15 at 20:21

2 Answers2

1

As you found out, the source of the "error" is that mlr and kernlab have different defaults for the type of predictions.

mlr maintains quite a bit of internal "state" and checks for each learner with respect to parameters of that learner and how training and testing are handled. You can get the type of prediction a learner will make with lrn$predict.type, which in your case gives "prob". If you want to know all the gory details, have a look at the implementation of classif.ksvm.

It is not recommended to mix mlr-wrapped learners and the "raw" learners like you do in the example and it shouldn't be necessary to do this. If you do mix them, things like you've found will happen -- so when using mlr, use only the mlr constructs to train models, make predictions, etc.

mlr does have test to make sure that the "raw" and the wrapped learner produce the same output, see e.g. the one for classif.ksvm.

Lars Kotthoff
  • 107,425
  • 16
  • 204
  • 204
  • Thanks for your answer and advice for the future, Lars. The reason I was mixing "raw" and wrapped learners was because I had tuned the model parameters in mlr and wanted to use the raster::predict function, which requires mod$learner.model. I could convert the raster to dataframe and then use mlr's normal predict, but it wouldn't be as efficient. – Tedward Aug 04 '15 at 12:09
  • Ah, I see. You should be able to do that with the wrapped learner directly though, shouldn't you? – Lars Kotthoff Aug 04 '15 at 16:14
  • I briefly tried to use the wrapped learner, but couldn't directly use it in raster's predict function. – Tedward Aug 05 '15 at 13:39
  • Here's the error: Assertion on 'task' failed: Must have class 'Task', but has class 'data.frame'. This happens when I run "raster::predict(r, mod)" instead of "raster::predict(r, mod$learner.model). Where r is a raster object and mod is trained mlr model. It's not a big deal, because it's easy to add "$model.learner". However, for integrating the mlr and raster packages, it be nice to know a way where this doesn't need to be specified. – Tedward Aug 05 '15 at 16:36
  • Ah of course, `mlr`'s `predict` expects a `Task` as the second argument. The crux is the definition of `predict.WrappedModel` in predict.R. As a quick and dirty "fix", you could simply swap `task` and `newdata` in the signature of that method and see if it works for you then. – Lars Kotthoff Aug 05 '15 at 16:50
  • Thanks for the help, Lars. Alas, my knowledge is pretty shallow and the fix is beyond my depth, but I hope it's useful to someone in the future. – Tedward Aug 05 '15 at 18:08
  • If you could provide a complete example and make a case for why this should be supported, we may consider changing `mlr` to support this :) – Lars Kotthoff Aug 05 '15 at 18:10
0

The answer lies here:

Why are probabilities and response in ksvm in R not consistent?

In short, ksvm type = "probabilities" gives different results than type = "response".

If I run

 > res2 <- predict(t$learner.model, data.frame(x2,x1,x3), type = "probabilities")
 > res2

then I get the same result as res1 above (type = "response" was default).

Unfortunately it seems that classifying an image based on the probabilities doesn't do as well as using the "response". Maybe that is still the best way to estimate the certainty of a classification?

Community
  • 1
  • 1
Tedward
  • 1,652
  • 15
  • 19