I performed some tests and it appears the R2
and RMSE
returned by the pls::R2
and pls::RMSEP
are not the mean stats of the 10 folds. They are calculated using all hold out samples at once by pulling predictions from all 10 CV folds, and comparing them to the observed outcomes:
Here is an example:
library(pls)
fit a model with inbuilt yarn data set:
data(yarn)
pls.fa <- plsr(density ~ NIR,
data = yarn,
ncomp = 6,
scale = TRUE,
validation = "CV",
segments = 10,
jackknife = TRUE)
I will use the equivalent caret
functions for comparison
The following code returns the RMSE obtained by using first 1:6 components:
pls::RMSEP(pls.fa, ncomp = 1:6, estimate = "CV", intercept = FALSE)
#output
1 comps 2 comps 3 comps 4 comps 5 comps 6 comps
8.4692 2.5553 1.9430 1.0151 0.7399 0.5801
To extract the RMSE in the form of a numeric vector:
unlist(lapply(1:6, function(x) pls::RMSEP(pls.fa,
ncomp = 1:6,
estimate = "CV",
intercept = FALSE)$val[,,x]))
lets compare the output with caret::RMSE
using all the data:
all.equal(
unlist(lapply(1:6, function(x) caret::RMSE(pls.fa$validation$pred[,,x],
yarn$density))),
unlist(lapply(1:6, function(x) pls::RMSEP(pls.fa,
ncomp = 1:6,
estimate = "CV",
intercept = FALSE)$val[,,x])))
#output
TRUE
So the RMSEP
is calculated by using all holdout predictions.
Equivalently the R2
:
all.equal(
unlist(lapply(1:6, function(x) caret::R2(pls.fa$validation$pred[,,x],
yarn$density,
form = "traditional"))),
unlist(lapply(1:6, function(x) pls::R2(pls.fa,
ncomp = 1:6,
estimate = "CV",
intercept = FALSE)$val[,,x])))
#output
TRUE
EDIT: to answer the questions in the comment:
which way is better to average RMSE over the folds, or to pull all predictions from the folds and calculate one RMSE:
In my opinion either way is good, one just needs to be consistent in the computation when comparing models. Consider the following example:
set.seed(1)
true <- rnorm(100)
fold <- sample(1:10, size = 100, replace = T)
pred <- rnorm(100)
z <- data.frame(true, pred, fold)
library(tidyverse)
z %>%
group_by(fold) %>%
summarise(rmse = caret::RMSE(true, pred)) %>%
pull(rmse) %>%
mean
#ouput
1.479923
z %>%
summarise(rmse = caret::RMSE(true, pred)) %>%
pull(rmse)
#ouput
1.441471
here averaging over the folds gives a more pessimistic result compared to pulling all predictions and calculating RMSE.
using same code with set.seed(2):
averaging over the folds: 1.442483
pulling all: 1.500432
now averaging over folds is more optimistic
So one way is not always more optimistic.