2

I am doing partial least square regression with the pls R package of Mevik (2007). The model with 10 fold cross-validation is as following:

pls.fa <- plsr(FA ~ ., ncomp = xcomp,scale = TRUE, validation = "CV", segments = 10,jackknife =TRUE, data=train)

After then, I can print out the accuracy, such as R2 or RMSE using:

R2(pls.fa,ncomp=1:xcomp)

where xcomp is the optimal number of component. The results for R2, for example look like this:

Intercept)      1 comps      2 comps      3 comps      4 comps      5 comps      6 comps      7 comps      8 comps      9 comps  
  -0.009828     0.551053     0.570584     0.574790     0.580414     0.583354     0.585812     0.580690     0.581536     0.595441  
   10 comps  
   0.596096  

My question is that: what is the R2 produced from this cross-validation, is that the mean of 10 folds?

Thanks

neilfws
  • 32,751
  • 5
  • 50
  • 63
Phuong Ho
  • 67
  • 1
  • 6

1 Answers1

4

I performed some tests and it appears the R2 and RMSE returned by the pls::R2 and pls::RMSEP are not the mean stats of the 10 folds. They are calculated using all hold out samples at once by pulling predictions from all 10 CV folds, and comparing them to the observed outcomes:

Here is an example:

library(pls)

fit a model with inbuilt yarn data set:

data(yarn)
pls.fa <- plsr(density ~ NIR,
               data = yarn,
               ncomp = 6,
               scale = TRUE,
               validation = "CV",
               segments = 10,
               jackknife = TRUE)

I will use the equivalent caret functions for comparison

The following code returns the RMSE obtained by using first 1:6 components:

pls::RMSEP(pls.fa, ncomp = 1:6, estimate = "CV", intercept = FALSE) 
#output
1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  
 8.4692   2.5553   1.9430   1.0151   0.7399   0.5801  

To extract the RMSE in the form of a numeric vector:

unlist(lapply(1:6, function(x) pls::RMSEP(pls.fa,
                                          ncomp = 1:6,
                                          estimate = "CV",
                                          intercept = FALSE)$val[,,x]))

lets compare the output with caret::RMSE using all the data:

all.equal(
  unlist(lapply(1:6, function(x) caret::RMSE(pls.fa$validation$pred[,,x],
                                             yarn$density))),
  unlist(lapply(1:6, function(x) pls::RMSEP(pls.fa,
                                            ncomp = 1:6,
                                            estimate = "CV",
                                            intercept = FALSE)$val[,,x])))
#output  
TRUE

So the RMSEP is calculated by using all holdout predictions.

Equivalently the R2:

all.equal(
  unlist(lapply(1:6, function(x) caret::R2(pls.fa$validation$pred[,,x],
                                           yarn$density,
                                           form = "traditional"))),
  unlist(lapply(1:6, function(x) pls::R2(pls.fa,
                                         ncomp = 1:6,
                                         estimate = "CV",
                                         intercept = FALSE)$val[,,x])))
#output  
TRUE

EDIT: to answer the questions in the comment:

which way is better to average RMSE over the folds, or to pull all predictions from the folds and calculate one RMSE:

In my opinion either way is good, one just needs to be consistent in the computation when comparing models. Consider the following example:

set.seed(1)
true <- rnorm(100)
fold <- sample(1:10, size = 100, replace = T)
pred <- rnorm(100)

z <- data.frame(true, pred, fold)

library(tidyverse)

z %>%
  group_by(fold) %>%
  summarise(rmse = caret::RMSE(true, pred)) %>%
  pull(rmse) %>%
  mean
#ouput
 1.479923
    
z %>%
  summarise(rmse = caret::RMSE(true, pred)) %>%
  pull(rmse) 
#ouput
1.441471

here averaging over the folds gives a more pessimistic result compared to pulling all predictions and calculating RMSE.

using same code with set.seed(2):

averaging over the folds: 1.442483 pulling all: 1.500432

now averaging over folds is more optimistic

So one way is not always more optimistic.

missuse
  • 19,056
  • 3
  • 25
  • 47
  • It is amazingly useful info for me, thanks very much @missuse. Does this mean that each time the model does cross-validation, it saves out the predicted values and after all 10 times it then pull them together and correlates them with the observed values? – Phuong Ho Apr 23 '18 at 07:16
  • @Phuong Ho Glad to help. Yes this is correct, all the predictions from CV are pulled and compared to the observed values. Edited answer to be more understandable. However, since the output pls object contains the indexes of the observations in the folds you may calculate any metric you desire by using customized functions. – missuse Apr 23 '18 at 08:40
  • @missue: would you think calculating R2 like the method above (i.e. all holdout cv-predictions and their actual values better than calculating R for each CV-fold and average them. I found the all holdout one producing higher R2 than averaging R2 of all 10 folds. – hn.phuong Apr 23 '18 at 09:54
  • @hn.phuong I usually calculate R2 for each fold and average them, simply because this way provides some estimation of the variance. But I do not think that is important, more important is to be consistent when comparing models. Either way works fine. – missuse Apr 23 '18 at 10:05
  • @missuse: I am asking this as I am doing the same model using h2o.deeplearning where I averaged the R2 of each fold and certainly we can look at variance as well. However, as I mentioned in the previous comment, I found that this mean R2 is smaller than the holdout R2, which might lead to so optimistic results – hn.phuong Apr 23 '18 at 10:18
  • Check edit in which I show one way of calculation is not always more optimistic than the other. Generally I think 10-fold CV is optimistic and that it should be performed only when the data is small. I prefer 2-5 fold CV with 3-5 repetitions. – missuse Apr 23 '18 at 10:37
  • Thanks a lot, I have just read that. It is helpful. But this post (https://stackoverflow.com/questions/43613443/difference-between-cross-val-score-and-cross-val-predict) and my own experience, when we increase number of folds, often, pulling all predictions is more optimistic. – hn.phuong Apr 23 '18 at 10:40
  • @hn.phuong generally I agree with the post you linked, however this depends a lot on how much data you have, the structure of the data (for instance it can happen that most of outliers are in on fold), how you sampled for the folds (stratified, with blocking...). In general I tend to try several things in CV and compare the performance returned by CV to the performance on a separate validation set. I am most content when they are close. Additionally two established R packages used for ML: `mlr` and `caret` both then to compute the metrics over each fold and average by default. – missuse Apr 23 '18 at 11:44
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/169587/discussion-between-missuse-and-hn-phuong). – missuse Apr 23 '18 at 11:46