1

Please consider this minimal reproducible example of a random forest regression estimate

library(randomForest)

# fix missing data
airquality <- na.roughfix(airquality)

set.seed(123)
#fit the random forest model
rf_fit <- randomForest(formula = Ozone ~ .,  data = airquality)

#define new observation
new <- data.frame(Solar.R=250, Wind=8, Temp=70, Month=5, Day=5)

set.seed(123)
#use predict all on new observation
rf_predict<-predict(rf_fit, newdata=new, predict.all = TRUE)

rf_predict$aggregate

library(tidyverse)

predict_mean <- rf_predict$individual %>% 
  as_tibble() %>% 
  rowwise() %>% 
  transmute(avg = mean(V1:V500))

predict_mean

I was expecting to get the same value by rf_predict$aggregate and predict_mean

Where and why am I wrong about this assumption?

My final objective is to get a confidence interval of the predicted value.

maxbre
  • 161
  • 9

1 Answers1

1

I believe your code needs to include a c_across() call for the calculation to be performed correctly:

The ?c_across documentations tells us:

c_across() is designed to work with rowwise() to make it easy to perform row-wise aggregations.

predict_mean <- rf_predict$individual %>% 
  as_tibble() %>% 
  rowwise() %>% 
  transmute(avg = mean(c_across(V1:V500)))

>predict_mean
[1] 30.5

An answer to a previous question, points out that mean() can't handle a data.frame. And in your code the data being provide to mean() is a row-wise data frame with class rowwise_df. c_across allows the data in the rows to be presented to mean() as vectors (I think).

xilliam
  • 2,074
  • 2
  • 15
  • 27
  • thank you for pointing out what I must consider my "big blunder"; but on second thought I'm still a little bit confused by this... `library(tidyverse) t <- tibble(a=1, b=2, c=3, d=4) # why this is working here t %>% rowwise() %>% transmute(avg=mean(a:d)) t %>% rowwise() %>% transmute(avg=mean(c_across(a:d))) # to be noted that with c_across # you can use the tidy select semantics everything(), which is a quite handy feature t %>% rowwise() %>% transmute(avg=mean(c_across(cols = everything())))` – maxbre Jan 05 '23 at 10:36
  • Interesting. I've added some explanation of how I understand the issue. – xilliam Jan 05 '23 at 12:08
  • I guess `rowMeans()` is the appropriate base function to be used instead of `mean`; `t %>% rowwise() %>% transmute(avg=rowMeans(.))` but still, remains (to me at least) the issue (my doubt!) why in my previous example this IS NOT failing: `t %>% rowwise() %>% transmute(avg=mean(c_across(a:d)))` – maxbre Jan 05 '23 at 14:01