0

I am trying to use Caret for building random forest model for binary classification. I have used randomForest source package to do this in past and it worked fine but using Caret my output is binary rather then probability. With type='prob', it gives error

Error in [.data.frame(out, , obsLevels, drop = FALSE) : undefined columns selected

I am using the same syntax (I hope) for both. This is what I used to get with source randomForest package.

>fit = randomForest(x = a[,-1], y = as.factor(a[,1]),ntree=120)
>head(predict(fit, newdata = test_data[,-c(1:2)], type = "prob")[,2])
         1          2          3          4          5          6 
0.04166667 0.03333333 0.55833333 0.80000000 0.87500000 0.04166667

Now, using Caret I am trying to do the same but its not accepting " type='prob' " in predict function, giving me the error

>rf_model<-train(x = a[,-1], y = as.factor(a[,1]),method="rf",ntree=120)
>head(predict(rf_model, test_data[,-c(1:2)], type="prob"))
Error in `[.data.frame`(out, , obsLevels, drop = FALSE) : 
undefined columns selected

Rather when I take out the "type", it gives me

>head(predict(rf_model, test_data[,-c(1:2)]))
[1] 0 0 1 1 1 0
Levels: 0 1

How do I get output in probabilities?

I need to create multiple algorithms after this and I think Caret would be more homogeneous to do that. I am sure I am missing something here but being new to Caret I don't know what.

HoneyBadger
  • 98
  • 2
  • 10

4 Answers4

2

UPDATE: I found the solution through here. Apparently, caret's train is not good with handling 0 and 1 binary class values in target variable. Changing them to any string ('r' and 's') worked perfectly.

> a$dv<-gsub('0','r',a$dv)
> a$dv<-gsub('1','s',a$dv)
> rf_model<-train(x = a[,-c(1:2)], y = as.factor(a[,2]),method="rf",ntree=120)
> head(predict(rf_model, test_data[,-c(1:2)], type="prob"))
      r           s
1 0.9750000 0.025000000
2 0.9916667 0.008333333
3 0.2583333 0.741666667
4 0.2833333 0.716666667
5 0.1583333 0.841666667
6 1.0000000 0.000000000 
Community
  • 1
  • 1
HoneyBadger
  • 98
  • 2
  • 10
0

Try to keep type = "prob", so that the predictions will be:

prd <- predict(rf_model, test_data[,-c(1:2)], type="prob")

but do whatever in Caret with:

as.factor(as.numeric(prd >= .5))

entropium
  • 29
  • 4
  • Thanks for replying! I am not sure what you mean by "Try to keep..." because I already tried it and got error from R (mentioned in the post). – HoneyBadger Nov 24 '15 at 14:40
  • Keep the predictions in probabilities, but factorize them when used in Caret, because Caret applies to 0-1 binary variables. Whichever has probability >= 0.5 is a "1", and < 0.5 is a "0". – entropium Nov 24 '15 at 14:51
  • "Keep the predictions in probabilities"..but that is what I want to get as output here!! My whole goal is to get probability predictions instead of binary (0,1) because type="prob" returns error upon usage in predict command. I am not following what you are trying to say. – HoneyBadger Nov 24 '15 at 22:34
0

It works fine with caret v6.0-41:

library(caret)
set.seed(1)
rf_model <- train(x = iris[,-5], y = as.factor(iris[,5]), method="rf", ntree=120)
tail(predict(rf_model, iris[, -5], type="prob"))

    setosa  versicolor virginica
145      0 0.000000000 1.0000000
146      0 0.000000000 1.0000000
147      0 0.008333333 0.9916667
148      0 0.000000000 1.0000000
149      0 0.000000000 1.0000000
150      0 0.025000000 0.9750000

R version 3.0.3 (2014-03-06) Platform: x86_64-w64-mingw32/x64 (64-bit)

I think the problem comes from your training data (a[,-1]) and testing data (test_data[,-c(1:2)]) not having exactly the same columns.

  • Nope, I checked it all. Both have same number and names of columns. "-c(1:2)" in test_data is just data cleaning part which I already removed in dataset 'a'. Your code with isis data is working though. – HoneyBadger Nov 25 '15 at 19:58
  • Yes indeed, it fails whatever numbers you have as classes, I've tried several combinations and got the same error. – Péter Elekes Nov 26 '15 at 07:13
0

You've probably long ago resolved this...but on current rev of caret, type = "prob" for a 2 level factor outputs 2 columns: probability of 0, probability of 1 (or whatever your 2 levels are).

Jeff J.
  • 65
  • 5