I'm using randomForest
package in R, for the purpose of predicting the distances between proteins (regression model in RF) "for a homology modeling purposes" and I obtained quite good results. However, I need to have a confidence level to rank my predicted values and filter out the bad models, so I wonder if there is any possibility to calculate such confidence level, or any other way of measuring the certainty of the predictions?
any suggestions or recommendations is highly appreciated
Asked
Active
Viewed 3,027 times
8

StupidWolf
- 45,075
- 17
- 40
- 72

DOSMarter
- 1,485
- 5
- 21
- 29
-
5One simple approach would be to simply treat the predictions from each tree in the forest as a sample of predictions, from which you can calculate a mean and standard error, just as if you were calculating a CI for a mean. – joran Jul 23 '13 at 14:25
1 Answers
1
Following the jackknife method highlighted in this paper to obtain the standard error, you can use an implementation in the package ranger
:
library(ranger)
library(mlbench)
data(BostonHousing)
mdl = ranger(medv ~ .,data=BostonHousing[1:400,],keep.inbag = TRUE)
pred = predict(mdl,BostonHousing[401:nrow(BostonHousing),],type="se")
head(cbind(pred$predictions,pred$se ))
[,1] [,2]
[1,] 10.673356 1.107839
[2,] 11.390374 1.102217
[3,] 12.760511 1.126945
[4,] 10.458128 1.100246
[5,] 10.720076 1.084376
[6,] 9.914648 1.102000
The confidence interval can be estimated as 1.96*se. There is also a new package forestError available that can work on randomForest objects:
library(randomForest)
library(forestError)
mdl = randomForest(medv ~ .,data=BostonHousing[1:400,],keep.inbag=TRUE)
err = quantForestError(mdl,BostonHousing[1:400,],BostonHousing[401:nrow(BostonHousing),])
head(err$estimates)
pred mspe bias lower_0.05 upper_0.05
1 10.649734 15.70943 -1.5336411 2.935949 12.59486
2 11.611078 15.16339 -1.4436056 3.897293 13.55621
3 12.603938 20.92701 -0.9590869 4.890153 22.32699
4 10.650549 12.42555 -1.4188440 3.941648 12.49029
5 10.414707 29.08155 -1.1438267 2.700922 31.42272
6 9.720305 19.63286 -1.3469671 2.006520 16.43220
You can refer to this paper for the actual method used,

StupidWolf
- 45,075
- 17
- 40
- 72