18

I have researched this extensively without finding a solution. I have cleaned my data set as follows:

library("raster")
impute.mean <- function(x) replace(x, is.na(x) | is.nan(x) | is.infinite(x) , 
mean(x, na.rm = TRUE))
losses <- apply(losses, 2, impute.mean)
colSums(is.na(losses))
isinf <- function(x) (NA <- is.infinite(x))
infout <- apply(losses, 2, is.infinite)
colSums(infout)
isnan <- function(x) (NA <- is.nan(x))
nanout <- apply(losses, 2, is.nan)
colSums(nanout)

The problem arises running the predict algorithm:

options(warn=2)
p  <-   predict(default.rf, losses, type="prob", inf.rm = TRUE, na.rm=TRUE, nan.rm=TRUE)

All the research says it should be NA's or Inf's or NaN's in the data but I don't find any. I am making the data and the randomForest summary available for sleuthing at [deleted] Traceback doesn't reveal much (to me anyway):

4: .C("classForest", mdim = as.integer(mdim), ntest = as.integer(ntest), 
       nclass = as.integer(object$forest$nclass), maxcat = as.integer(maxcat), 
       nrnodes = as.integer(nrnodes), jbt = as.integer(ntree), xts = as.double(x), 
       xbestsplit = as.double(object$forest$xbestsplit), pid = object$forest$pid, 
       cutoff = as.double(cutoff), countts = as.double(countts), 
       treemap = as.integer(aperm(object$forest$treemap, c(2, 1, 
           3))), nodestatus = as.integer(object$forest$nodestatus), 
       cat = as.integer(object$forest$ncat), nodepred = as.integer(object$forest$nodepred), 
       treepred = as.integer(treepred), jet = as.integer(numeric(ntest)), 
       bestvar = as.integer(object$forest$bestvar), nodexts = as.integer(nodexts), 
       ndbigtree = as.integer(object$forest$ndbigtree), predict.all = as.integer(predict.all), 
       prox = as.integer(proximity), proxmatrix = as.double(proxmatrix), 
       nodes = as.integer(nodes), DUP = FALSE, PACKAGE = "randomForest")
3: predict.randomForest(default.rf, losses, type = "prob", inf.rm = TRUE, 
       na.rm = TRUE, nan.rm = TRUE)
2: predict(default.rf, losses, type = "prob", inf.rm = TRUE, na.rm = TRUE, 
       nan.rm = TRUE)
1: predict(default.rf, losses, type = "prob", inf.rm = TRUE, na.rm = TRUE, 
       nan.rm = TRUE)
Elliott
  • 303
  • 1
  • 3
  • 8
  • Hard to tell without more information about the forest itself (your file contained only the data). But I do wonder where you got the idea that `inf.rm`, `na.rm` or `nan.rm` were arguments for `predict.randomForest`. They certainly aren't in the documentation. – joran Feb 23 '14 at 04:39
  • The zip file contained the RF summary. It is no longer available. The NA, Inf and NaN are forms of missing or uncomputable data that can prevent RF from running. Nate's answer works. – Elliott Feb 24 '14 at 13:12
  • I know perfectly well what NA, Inf and NaN are. I was pointing out that those arguments simply do not exist for that predict function. They are ignored completely. – joran Feb 24 '14 at 14:14
  • @joran problem was they weren't being ignored, thanks – Elliott Feb 24 '14 at 20:46
  • No, that is wrong. Passing non-existent arguments causes them to be passed via `...` which, per the documentation, is ignored. Each of those three arguments are ignored entirely. – joran Feb 24 '14 at 20:51
  • @joran dude, the predict statement *did not run* with the "ignore" statements. it did run once the data was was cleaned using the code in the answer. your hostility is way over the top, dude, so chill, pleease – Elliott Feb 25 '14 at 15:17
  • 1
    I don't see how anything I've said could be seen as hostile, but I'm sorry if you've seen it that way. Perhaps we're misunderstanding each other. The predict statement did not run because (as pointed out in the correct answer below) you hadn't completely removed the NAs, NaNs, etc. But the `inf.rm = TRUE, na.rm=TRUE, nan.rm=TRUE` arguments really are ignored, and have no effect at all. That was my only point. That you have to removed those values manually; there are no arguments to `predict.randomForest` with those names. – joran Feb 25 '14 at 15:35
  • @joran those with ears hear, dude. thanks for the clarification – Elliott Feb 26 '14 at 15:34

2 Answers2

16

Your code is not entirely reproducible (there's no running of the actual randomForest algorithm) but you are not replacing Inf values with the means of column vectors. This is because the na.rm = TRUE argument in the call to mean() within your impute.mean function does exactly what it says -- removes NA values (and not Inf ones).

You can see this, for example, by:

impute.mean <- function(x) replace(x, is.na(x) | is.nan(x) | is.infinite(x), mean(x, na.rm = TRUE))
losses <- apply(losses, 2, impute.mean)
sum( apply( losses, 2, function(.) sum(is.infinite(.))) )
# [1] 696

To get rid of infinite values, use:

impute.mean <- function(x) replace(x, is.na(x) | is.nan(x) | is.infinite(x), mean(x[!is.na(x) & !is.nan(x) & !is.infinite(x)]))
losses <- apply(losses, 2, impute.mean)
sum(apply( losses, 2, function(.) sum(is.infinite(.)) ))
# [1] 0
Nate Pope
  • 1,696
  • 12
  • 11
13

One cause of the error message:

NA/NaN/Inf in foreign function call (arg X)

When training a randomForest is having character-class variables in your data.frame. If it comes with the warning:

NAs introduced by coercion

Check to make sure that all of your character variables have been converted to factors.

Example

set.seed(1)
dat <- data.frame(
  a = runif(100),
  b = rpois(100, 10),
  c = rep(c("a","b"), 100),
  stringsAsFactors = FALSE
)

library(randomForest)
randomForest(a ~ ., data = dat)

Yields:

Error in randomForest.default(m, y, ...) : NA/NaN/Inf in foreign function call (arg 1) In addition: Warning message: In data.matrix(x) : NAs introduced by coercion

But switch it to stringsAsFactors = TRUE and it runs.

Sam Firke
  • 21,571
  • 9
  • 87
  • 105