5

I would like to use rfcv to cull the unimportant variables from a data set before creating a final random forest with more trees (please correct and inform me if that's not the way to use this function). For example,

>     data(fgl, package="MASS")
>     tst <- rfcv(trainx = fgl[,-10], trainy = fgl[,10], scale = "log", step=0.7)
>     tst$error.cv
        9         6         4         3         2         1 
0.2289720 0.2149533 0.2523364 0.2570093 0.3411215 0.5093458

In this case, if I understand the result correctly, it seems that we can remove three variables without negative side effects. However,

>     attributes(tst)
$names
[1] "n.var"     "error.cv"  "predicted"

None of these slots tells me what those first three variables that can be harmlessly removed from the dataset actually were.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194
tresbot
  • 1,570
  • 2
  • 15
  • 19
  • Data ? Code ? you should first look at http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example – dickoa Aug 10 '12 at 21:26
  • I added code, sorry about that. I found that randomForest has a value "importance", and I can change around rfcv a bit to just have it output that as well. I'm still confused, though, as to what purpose rfcv really has if it doesn't output the variables that can potentially be ignored. – tresbot Aug 10 '12 at 22:08

1 Answers1

9

I think the purpose of rfcv is to establish how your accuracy is related to the number of variables you use. This might not seem useful when you have 10 variables, but when you have thousands of variables it is quite handy to understand how much those variables "add" to the predictive power.

As you probably found out, this code

rf<-randomForest(type ~ .,data=fgl)
importance(rf)

gives you the relative importance of each of the variables.

nograpes
  • 18,623
  • 1
  • 44
  • 67
  • Hi @nograpes. Remaining in the context of this example, if I understand correctly, you suggest to remove or not consider the three least important variables listed in the `importance(rf)` call. But the variable importance in RF, especially with correlated variables, is affected by randomness, so the least important variables might change from a run to another. In addition, the `rfcv` call does not allow to set the `mtry` that you might have already set for the `randomForest` call. That for the importance rank in `rfcv` might differ from the one in `importance(rf)` – Nemesi Oct 25 '18 at 13:38