I am a freshman of R. I want to use random forest to screen which variable is importance to discriminate my data A or B. But the problem is that i have 100 groups A, but just 30 groups B. So I want to random selection 20 A and 20 B to train my tree, and use the remain 80 A and 10 B to test my tree. And then loop it until get the best tree. I was really fresh here, and have big trouble to write this codes
Asked
Active
Viewed 1,720 times
0
-
1Check the `caret` package for prediction, training, testing etc: http://caret.r-forge.r-project.org/splitting.html Note particularly the functions allow creation of balanced train test sets or folds/boot samples. Eg.. `createFolds`, `createResamples` and of course training with randomforest amongst many other things. – Stephen Henderson Jan 01 '14 at 11:23
-
Hi, thank u so much for your help. But I still have some questions. createDataPartition and createResample can sample like 50% of the A and B,but I want to sample 20% A and 80% B. I don't how to do that. And I still don't know how to loop it until we have the best tree, such as the biggest ROC or the minimum Error Rate. It would be great if you could add some explanations, to make it easier for freshman to understand and use. Thank u so much again. – user35815 Jan 02 '14 at 03:11
1 Answers
1
Please consider that you will probably be best off by a) submitting a small sample set and b) by reading the documentation that supports this function. There are only a few examples listed in randomForest
, but one of them does exactly what you ask about. I'll re-create the 1st part of your request exactly:
data(iris)
table(iris$Species)
#this re-creates your data set
iris <- iris[!(iris$Species == "setosa"), ]
iris <- iris[21:100, ]
iris <- rbind(iris[iris$Species == "virginica", ], iris)
iris <- droplevels(iris)
table(iris$Species)
# versicolor virginica
# 30 100
#this runs rf: sampsize is clearly the parameter you need
(iris.rf <- randomForest(Species ~ ., data = iris, ntree = 500, importance = T,
sampsize=c(30, 30), do.trace = 100) )
# OOB estimate of error rate: 6.92%
# Confusion matrix:
# versicolor virginica class.error
# versicolor 27 3 0.10
# virginica 6 94 0.06
#there's no sample size here
(iris.rf.full <- randomForest(Species ~ ., data = iris, ntree = 500, importance = T,
do.trace = 100) )
#Yep, the error rate seems a bit smaller
# OOB estimate of error rate: 3.08%
# Confusion matrix:
# versicolor virginica class.error
# versicolor 26 4 0.1333333
# virginica 0 100 0.0000000
However, I caution you that these steps
and use the remain 80 A and 10 B to test my tree. And then loop it until get the best tree.
are not necessary under the randomForest
: the algorithm tests your data "in-place" and loops it (as described by the ntree argument). Try setting do.trace argument to 5 and see how the OOB error reacts.