0

I'm trying to use the randomForest package in R, but I've encountered a problem where R tells me that there is missing data in the response vector.

> rf_blackcomb_earlyGame <- randomForest(max_cohort ~ ., data=blackcomb_earlyGame[-c(1,2), ])
Error in na.fail.default(list(max_cohort = c(47, 25, 20, 37, 1, 0, 23,  : 
missing values in object

The specified error is clear enough. I've encountered it before and in the past there actually have been missing data, but this time there aren't any missing data.

> class(blackcomb_earlyGame$max_cohort)
[1] "numeric"
> which(is.na(blackcomb_earlyGame$max_cohort))
integer(0)

I've tried using na.roughfix to see if that will help, but I get the following error.

Error in na.roughfix.data.frame(list(max_cohort = c(47, 25, 20, 37, 1,  : 
na.roughfix only works for numeric or factor

I've checked every vector to make sure that none of them contain any NAs, and none of them do.

Does anyone have any suggestions?

dww
  • 30,425
  • 5
  • 68
  • 111
Brad Davis
  • 1,063
  • 3
  • 18
  • 38
  • Can you show the output of `sapply(blackcomb_earlyGame, function(x) any(is.na(x)))`? – Gregor Thomas Jul 05 '16 at 18:04
  • I will try to. My R server crashed, restarting it and reloading the data. – Brad Davis Jul 05 '16 at 18:05
  • 1
    Most likely, you have a column of type character. can you post the output from `str(blackcomb_earlyGame)` – dww Jul 05 '16 at 21:53
  • Yes, that seems to be the problem. There was a column that I had thought I'd cast to a factor but it was a character. – Brad Davis Jul 05 '16 at 22:59
  • @dww please post this as an answer so OP can accept it as the solution. Also, please consider providing a reproducible example in the future. – Roman Luštrik Jul 06 '16 at 07:26
  • I can't provide a reproducible example in this case because I'm using private proprietary data. Since the problem was w/ the data I couldn't make a toy example that had the same problem. That said, the advice dww provided helped me identify the problem. And the bit of code that Gregor provided has made it much easier for me to find NAs in columns. Previously it was a rather onerous process. – Brad Davis Jul 06 '16 at 17:46
  • And I don't understand why anyone is down voting this question. I had a problem that I couldn't sort out, and someone showed how to identify where the problem came from. – Brad Davis Jul 06 '16 at 17:48
  • I'm guessing that downvotes may be due to not including data to make this a reproducible example. If your data are confidential, they can easily be anonymised (see [here](https://stackoverflow.com/questions/10454973/how-to-create-example-data-set-from-private-data-replacing-variable-names-and-l)). I would recommend adding a few rows of anonymised data as an update to the question, to avoid downvotes. Using the function in the Q I linked to, type `dput(head(anonymiseColumns(blackcomb_earlyGame)))` and paste into your question as an addendum. – dww Jul 06 '16 at 19:14

2 Answers2

4

randomForest can fail due to a few different types of issues with the data. Missing values (NA), values of NaN, Inf or -Inf, and character types that have not been cast into factors will all fail, with a variety of error messages.

We can see below some examples of the error messages generated by each of these issues:

my.df <- data.frame(a = 1:26, b=letters, c=(1:26)+rnorm(26))
rf <- randomForest(a ~ ., data=my.df)
# this works without issues, because b=letters is cast into a factor variable by default

my.df$d <- LETTERS    # Now we add a character column
rf <- randomForest(a ~ ., data=my.df)
# Error in randomForest.default(m, y, ...) : 
#   NA/NaN/Inf in foreign function call (arg 1)
# In addition: Warning message:
#   In data.matrix(x) : NAs introduced by coercion

rf <- randomForest(d ~ ., data=my.df)
# Error in y - ymean : non-numeric argument to binary operator
# In addition: Warning message:
#   In mean.default(y) : argument is not numeric or logical: returning NA

my.df$d <- c(NA, rnorm(25))
rf <- randomForest(a ~ ., data=my.df)
rf <- randomForest(d ~ ., data=my.df)
# Error in na.fail.default(list(a = 1:26, b = 1:26, c = c(3.14586293058335,  : 
#   missing values in object

my.df$d <- c(Inf, rnorm(25))
rf <- randomForest(a ~ ., data=my.df)
rf <- randomForest(d ~ ., data=my.df)
# Error in randomForest.default(m, y, ...) : 
#   NA/NaN/Inf in foreign function call (arg 1)

Interestingly, the error message you received, which was caused by having a character type in the data frame (see comments), is the error that I see when there is a numeric column with NA. This suggests that there may either be (1) differences in the errors from different versions of randomForest or (2) that the error message depends in more complex ways on the structure of the data. Either way, the advice for anyone receiving errors such as these is to look for all of the possible issues with the data listed above, in order to track down the cause.

dww
  • 30,425
  • 5
  • 68
  • 111
2

Perhaps there are Inf or -Inf values?

is.na(c(1, NA, Inf, NaN, -Inf))
#[1] FALSE  TRUE FALSE  TRUE FALSE

is.finite(c(1, NA, Inf, NaN, -Inf))
#[1]  TRUE FALSE FALSE FALSE FALSE
Robert Hijmans
  • 40,301
  • 4
  • 55
  • 63