The Effect of Specifying Training Data as New Data when Making Random Forest Predictions in R

Question

While using the predict function in R to get the predictions from a Random Forest model, I misspecified the training data as newdata as follows:

RF1pred <- predict(RF1, newdata=TrainS1, type = "class")

Used like this, I get extremely high accuracy and AUC, which I am sure is not right, but I couldn't find a good explanation for it. This thread is the closest I got, but I can's say I fully understand the explanation there.

If someone could elaborate, I will be grateful.

Thank you!

EDIT: Important to note: I get sensible accuracy and AUC if I run the prediction without specifying a dataset altogether, like so:

RF1pred <- predict(RF1, type = "class")

If a new dataset is not explicitly specified, isn't the training data used for prediction. Hence, shouldn't I get the same results from both lines of code?

EDIT2: Here is a sample code with random data that illustrates the point. When predicting without specifying newdata, the AUC is 0.4893. When newdata=train is explicitly specified, the AUC is 0.7125.

# Generate sample data
set.seed(15)
train <- data.frame(x1=sample(0:1, 100, replace=T), x2=rpois(100,10), y=sample(0:1, 100, replace=T))

# Build random forest
library(randomForest)
model <- randomForest(x1 ~ x2, data=train)
pred1 <- predict(model)
pred2 <- predict(model, newdata = train)

# Calculate AUC
library(ROCR)
ROCRpred1 <- prediction(pred1, train$x1)
AUC <- as.numeric(performance(ROCRpred1, "auc")@y.values)
AUC  # 0.4893
ROCRpred2 <- prediction(pred2, train$x1)
AUC <- as.numeric(performance(ROCRpred2, "auc")@y.values)
AUC  # 0.7125

I think that previous question does answer yours. You get such high accuracy because you are applying the derived algorithm to the data from which it was derived. In other words, you are running an in-sample test of model fit. — ulfelder, Jul 23 '15 at 18:10
OK, I should have mentioned that I get normal results when I skip the newdata option. If not explicitly specifying newdata, isn't the algorithms again applied to the same (training) data? — DGenchev, Jul 23 '15 at 18:17
I've used `randomForest` for prediction on classification problems before, so I have a couple of ideas about what might be going on, but it's hard to say without a reproducible example. Even if you can't share the data you're using, it would help a lot to share all of the code you're using, and including the data or some facsimile thereof would be ideal. — ulfelder, Jul 23 '15 at 19:23
FWIW, I just ran this exercise on a classification example I have on my hard drive and got identical results from `predict(rf, type = "class")` and `predict(rf, newdata = subdat, type = "class")` following `rf <- randomForest(f.rf, data = subdat, na.action = na.exclude, ntree = 1000, mtry = 3, cutoff = c(0.2,0.8))`. So I'm guessing there's a bug somewhere else in your process. — ulfelder, Jul 23 '15 at 19:40
`RF1 <- randomForest(y ~ ., data = Train)` `RF1pred <- predict(RF1, type="class")` Alternatively `RF1pred <- predict(RF1, newdata = Train, type="class")` — DGenchev, Jul 23 '15 at 20:33
What do you get if you compare those two vectors returned by `predict()`? If they don't include NAs, you can just use `sum(RF1pred.1 == RF1pred.2) / length(RF1pred.1)`. That should equal 1. If they do include NAs, you could use `sum(RF1pred.1 == RFpred.2, na.rm=TRUE) / (length(RFpred.1) - sum(is.na(RF1pred.1)))`. — ulfelder, Jul 23 '15 at 20:49
`sum(RF1pred.1 == RF1pred.2) / length(RF1pred.1)` yields 0. The two vectors look quite different. I will assemble the data tomorrow and share it with you. — DGenchev, Jul 23 '15 at 21:16

score 1 · Accepted Answer · answered Aug 17 '18 at 19:27

If you look at the documentation for predict.randomForest you will see that if you do not supply a new data set you will get the out-of-bag (OOB) performance of the model. Since the OOB performance is theoretically related to the performance of your model on a different data set, the results will be much more realistic (although still not a substitute for a real, independently collected, validation set).

The Effect of Specifying Training Data as New Data when Making Random Forest Predictions in R

1 Answers1

Linked