15

I'm dealing for the first time with random forests and I'm having some troubles that I can't figure out.. When I run the analysis on all my dataset (about 3000 rows) I don't get any error message. But when I perform the same analysis on a subset of my dataset (about 300 rows) I get an error:

dataset <- read.csv("datasetNA.csv", sep=";", header=T)
names (dataset)
dataset2 <- dataset[complete.cases(dataset$response),]
library(randomForest)
dataset2 <- na.roughfix(dataset2)
data.rforest <- randomForest(dataset2$response ~ dataset2$predictorA + dataset2$predictorB+ dataset2$predictorC + dataset2$predictorD + dataset2$predictorE + dataset2$predictorF + dataset2$predictorG + dataset2$predictorH + dataset2$predictorI, data=dataset2, ntree=100, keep.forest=FALSE, importance=TRUE)

# subset of my original dataset:
groupA<-dataset2[dataset2$order=="groupA",]
data.rforest <- randomForest(groupA$response ~ groupA$predictorA + groupA$predictorB+ groupA$predictorC + groupA$predictorD + groupA$predictorE + groupA$predictorF + groupA$predictorG + groupA$predictorH + groupA$predictorI, data=groupA, ntree=100, keep.forest=FALSE, importance=TRUE)

Error in randomForest.default(m, y, ...) : Can't have empty classes in y.

However, my response variable hasn't any empty class.

If instead I write randomForest like this (a+b+c,y) instead than (y ~ a+b+c) I get this other message:

Error in if (n == 0) stop("data (x) has 0 rows") : 
  argument length zero
Warning messages:
1: In Ops.factor(groupA$responseA + groupA$responseB,  :
  + not meaningful for factors

The second problem is that when I try to impute my data through rfImpute() I get an error:

Errore in na.roughfix.default(x) :  roughfix can only deal with numeric data

However my columns are all factors and numeric.

Can somebody see where I'm wrong???

joran
  • 169,992
  • 32
  • 429
  • 468
user1842218
  • 171
  • 1
  • 1
  • 3
  • See [this](http://stackoverflow.com/q/5963269/324364) question for help on adding example data to your question. (Also note the formatting toolbar above the area you're typing in.) – joran Nov 21 '12 at 15:54

9 Answers9

23

Based on the discussion in the comments, here's a guess at a potential solution.

The confusion here arises from the fact that the levels of a factor are an attribute of the variable. Those levels will remain the same, no matter what subset you take of the data, no matter how small that subset. This is a feature, not a bug, and a common source of confusion.

If you want to drop missing levels when subsetting, wrap your subset operation in droplevels():

groupA <- droplevels(dataset2[dataset2$order=="groupA",])

I should probably also add that many R users set options(stringsAsFactors = FALSE) when starting a new session (e.g. in their .Rprofile file) to avoid these kinds of hassles. The downside to doing this is that if you share your code with other people frequently, this can cause problems if they haven't altered R's default options.

joran
  • 169,992
  • 32
  • 429
  • 468
  • `levels(dataset$Column) dataset$Column <- factor(dataset$Column) levels(dataset$Column)` an alternative to droplevels could be just to use factor() method and assign it to the same variable. The levels() method is to print and validate whether they have been removed. – rishiehari Apr 01 '14 at 15:05
  • Alternatively, if you need your model to be able to predict that level, you need to ensure your subset includes examples of all levels. – The_Tams Mar 21 '23 at 18:44
8

When factor levels are removed by subsetting, you must reset levels:

levels(train11$str);
[1] "B" "D" "E" "G" "H" "I" "O" "T" "X" "Y" "b";
train11$str <- factor(train11$str);
levels(train11$str);
[1] "B" "D" "E" "G" "H" "I" "O" "T" "b"
James A Mohler
  • 11,060
  • 15
  • 46
  • 72
4

This is because you are sub setting your training set before sending the data to your random forest, and while sub setting there is a possibility of loosing some levels from your response variable after sub setting, therefore one need to reassign the factors by using this:

dataset2$response <- factor(dataset2$response)

to remove additional levels which are not present in data, after sub setting.

3

Try using the function formula before passing it to randomForest:

formula("y ~ a+b+c")

This fixed the problem for me.

Or it might that randomForest mistakes a parameter for another one.

Try specifying what each parameter is:

randomForest(,,, data=my_data, mtry=my_mtry, etc)
Timothée HENRY
  • 14,294
  • 21
  • 96
  • 136
1

randomForest(x = data, y = label, importance = TRUE, ntree = 1000)

label is a factor, so use droplevels(label) to remove the levels with zero count before passing to randomForest function. It works.

To check the count for each level use table(label) function.

1

When using the formula interface, if there is any outcome variable class for which there is not at least one complete case in your predictor variables, you will get this error.

Omitting NAs in your predictor variables can sometimes result in an entire class being omitted from your data, in which case you would have a factor level defined for which there are no observations.

set.seed(1)
df <- data.frame(y = gl(5, 10),  
                 x1 = factor(rep(c('a', 'b', 'c', 'd', 'e'), 10)), 
                 x2 = runif(50),  x3 = rnorm(50))
df[df$y == 2, "x1"] <- NA
randomForest(as.formula(y ~ x1 + x2 + x3), 
             data = df, ntree = 10, na.action = na.omit)
> Error in randomForest.default(m, y, ...) : Can't have empty classes in y.

Try using complete.cases() wrapped in droplevels()

randomForest(as.formula(y ~ x1 + x2 + x3),
             data = droplevels(df[complete.cases(df), ]), 
             ntree = 10, na.action = na.omit)
Mark Egge
  • 638
  • 8
  • 9
0

It seems the problem in the call statement. If you use formula interface then call

randomForest(response ~ predictorA + predictorB + ... + predictorI, data=groupA, ntree=100, keep.forest=FALSE, importance=TRUE)

But it is more convenient and faster to explicitly pass x and y

randomForest(y = groupA$response, x = groupA[,c("predictorA", "predictorB", ...)], ntree=100, keep.forest=FALSE, importance=TRUE)

Instead of names of variables you can use their indices. Try these suggestions.

DrDom
  • 4,033
  • 1
  • 21
  • 23
  • I just tried both ways, but I still get the same message: "Error in randomForest.default(m, y, ...) : Can't have empty classes in y" – user1842218 Nov 21 '12 at 15:16
  • @user1842218 I suspect that you are mistaken, and that R is correct, in that by taking a subset of your data you actually have removed all instances of one of the levels of a factor. (Error messages rarely flat out lie.) – joran Nov 21 '12 at 15:18
  • I checked my subset, and it is complete. Every factor I use in the formula consists in multiple levels. I also saved my subset and opened it again as a new dataset, however nothing change – user1842218 Nov 21 '12 at 15:23
  • @user1842218, provide minimal sample of the data which causes this error. – DrDom Nov 21 '12 at 15:26
  • 1
    @user1842218 "Every factor I use in the formula consists in multiple levels." The error message is not saying that one of your factors has only one level, it's saying that _one_ of the levels doesn't actually appear. Until you provide a reproducible example demonstrating otherwise, I'm sticking by my belief in R's error message. – joran Nov 21 '12 at 15:31
  • Im' trying to paste few rows but it get messy. How can I upload a small dataset? – user1842218 Nov 21 '12 at 15:42
  • Joran you are right, in my subset the response variable doesn't have one of the levels of the original sample. What is strange is that if I type levels(response) I get also the missing level. How can I tell R to not consider it??? – user1842218 Nov 21 '12 at 15:58
0

Just another suggestion to add to the mix: There is a chance that you don't want read.csv() to interpret strings as factors. Try adding this to read.csv to force conversion to characters:

dataset <- read.csv("datasetNA.csv", 
                    sep=";", 
                    header=T,
                    colClasses="character")
Jmoney38
  • 3,096
  • 2
  • 25
  • 26
-1

I had the same problem with you today and I had it solved. When you do Random Forest, the R default is classification while my response is numerical. When you use subsets as training dataset, the levels of the training are restricted compared with the test.