8

I'm using randomForest package in R, for the purpose of predicting the distances between proteins (regression model in RF) "for a homology modeling purposes" and I obtained quite good results. However, I need to have a confidence level to rank my predicted values and filter out the bad models, so I wonder if there is any possibility to calculate such confidence level, or any other way of measuring the certainty of the predictions? any suggestions or recommendations is highly appreciated

StupidWolf
  • 45,075
  • 17
  • 40
  • 72
DOSMarter
  • 1,485
  • 5
  • 21
  • 29
  • 5
    One simple approach would be to simply treat the predictions from each tree in the forest as a sample of predictions, from which you can calculate a mean and standard error, just as if you were calculating a CI for a mean. – joran Jul 23 '13 at 14:25

1 Answers1

1

Following the jackknife method highlighted in this paper to obtain the standard error, you can use an implementation in the package ranger :

library(ranger)
library(mlbench)
data(BostonHousing)

mdl = ranger(medv ~ .,data=BostonHousing[1:400,],keep.inbag = TRUE)

pred = predict(mdl,BostonHousing[401:nrow(BostonHousing),],type="se")

 head(cbind(pred$predictions,pred$se ))
          [,1]     [,2]
[1,] 10.673356 1.107839
[2,] 11.390374 1.102217
[3,] 12.760511 1.126945
[4,] 10.458128 1.100246
[5,] 10.720076 1.084376
[6,]  9.914648 1.102000

The confidence interval can be estimated as 1.96*se. There is also a new package forestError available that can work on randomForest objects:

library(randomForest)
library(forestError)
mdl = randomForest(medv ~ .,data=BostonHousing[1:400,],keep.inbag=TRUE)

err = quantForestError(mdl,BostonHousing[1:400,],BostonHousing[401:nrow(BostonHousing),])

head(err$estimates)
       pred     mspe       bias lower_0.05 upper_0.05
1 10.649734 15.70943 -1.5336411   2.935949   12.59486
2 11.611078 15.16339 -1.4436056   3.897293   13.55621
3 12.603938 20.92701 -0.9590869   4.890153   22.32699
4 10.650549 12.42555 -1.4188440   3.941648   12.49029
5 10.414707 29.08155 -1.1438267   2.700922   31.42272
6  9.720305 19.63286 -1.3469671   2.006520   16.43220

You can refer to this paper for the actual method used,

StupidWolf
  • 45,075
  • 17
  • 40
  • 72