How to random selection a subset to train a randomForest tree and loop it?

Question

I am a freshman of R. I want to use random forest to screen which variable is importance to discriminate my data A or B. But the problem is that i have 100 groups A, but just 30 groups B. So I want to random selection 20 A and 20 B to train my tree, and use the remain 80 A and 10 B to test my tree. And then loop it until get the best tree. I was really fresh here, and have big trouble to write this codes

Check the `caret` package for prediction, training, testing etc: http://caret.r-forge.r-project.org/splitting.html Note particularly the functions allow creation of balanced train test sets or folds/boot samples. Eg.. `createFolds`, `createResamples` and of course training with randomforest amongst many other things. — Stephen Henderson, Jan 01 '14 at 11:23
Hi, thank u so much for your help. But I still have some questions. createDataPartition and createResample can sample like 50% of the A and B,but I want to sample 20% A and 80% B. I don't how to do that. And I still don't know how to loop it until we have the best tree, such as the biggest ROC or the minimum Error Rate. It would be great if you could add some explanations, to make it easier for freshman to understand and use. Thank u so much again. — user35815, Jan 02 '14 at 03:11

score 1 · Answer 1 · edited May 23 '17 at 12:28

Please consider that you will probably be best off by a) submitting a small sample set and b) by reading the documentation that supports this function. There are only a few examples listed in randomForest, but one of them does exactly what you ask about. I'll re-create the 1st part of your request exactly:

data(iris)
table(iris$Species)

#this re-creates your data set
iris <- iris[!(iris$Species == "setosa"), ]
iris <- iris[21:100, ]
iris <- rbind(iris[iris$Species == "virginica", ], iris)
iris <- droplevels(iris)
table(iris$Species)
# versicolor  virginica 
# 30        100 

#this runs rf: sampsize is clearly the parameter you need
(iris.rf <- randomForest(Species ~ ., data = iris, ntree = 500, importance = T, 
                       sampsize=c(30, 30), do.trace = 100) )
# OOB estimate of  error rate: 6.92%
# Confusion matrix:
#   versicolor virginica class.error
# versicolor         27         3        0.10
# virginica           6        94        0.06

#there's no sample size here
(iris.rf.full <- randomForest(Species ~ ., data = iris, ntree = 500, importance = T, 
                         do.trace = 100) )
#Yep, the error rate seems a bit smaller
# OOB estimate of  error rate: 3.08%
# Confusion matrix:
#   versicolor virginica class.error
# versicolor         26         4   0.1333333
# virginica           0       100   0.0000000

However, I caution you that these steps

and use the remain 80 A and 10 B to test my tree. And then loop it until get the best tree.

are not necessary under the randomForest: the algorithm tests your data "in-place" and loops it (as described by the ntree argument). Try setting do.trace argument to 5 and see how the OOB error reacts.

How to random selection a subset to train a randomForest tree and loop it?

1 Answers1