1

How can I implement stratified sampling in a randomForest regression in R? I know that the strata and sampsize parameters are used in randomForest classification problems, but I get Error in { : task 1 failed - "sampsize should be of length one."

My data:

x <- sample(1:10, 100, replace = TRUE)
y <- sample(1:20, 100, replace = TRUE)
Region <- sample(c('N', 'S'), 100, replace = TRUE)

df <- data.frame(x, y, Region)

My code:

randomForest(x ~ y, data = df, sampsize = c(30,20), strata = df$Region)

My actual analysis has far worse imbalance between groups than even this. Thank you.

ecologist1234
  • 225
  • 1
  • 8
  • 2
    When asking for help, you should include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. We don't need your entire real dataset; maybe use a built in dataset to illustrate your problem. – MrFlick Mar 09 '18 at 15:54
  • It looks like your code is correct, based on the documentation... Not sure why working. Have you considered stratifying outside the model statement? – pyll Mar 09 '18 at 16:26
  • 1
    Thanks. Huh. I have considered stratifying the training dataset, but that means using only a tiny fraction of data from the larger groups (Regions). By sampling for each tree I should have a greater proportion of data points enter the full forest (i.e., at least one tree). – ecologist1234 Mar 09 '18 at 16:31
  • 1
    You can use `sampsize` on a per strata basis only for classification. You are running it as a regression problem. – IRTFM Mar 09 '18 at 19:37
  • Aha. I was starting to suspect that. Interesting, I did get it to run with regression when I found that I had to convert "Region" from character to factor. But as you say, it doesn't seem to be sampling by stratum. The help file for randomForest also isn't real clear that 'strata' only works for classification. – ecologist1234 Mar 09 '18 at 19:55

0 Answers0