1

I am referring to the scheme described at page 11 of this paper: https://arxiv.org/pdf/1904.10890.pdf.

enter image description here

Instead of having a test set and large train set broken down into 5 folds, you take whole dataset and create 6 folds. Then each of the 6 folds is considered as the "test set" for the other 5 folds. The idea is the you end up using the whole dataset as a test set. Plus, you get 6 sets of performance metrics instead of just one.

I don't know much about tidymodels besides recipes (which I love). Will tidymodels allow me to do something similar to this, or should I just use {rsample} to create the 6 folds and then build a custom approach?

Zoltan
  • 760
  • 4
  • 15

2 Answers2

1

Can't comment yet, so have to post this as a full answer, sorry. Anyway: the tidymodels intro itself refers to {rsample} as part of the tidymodels framework. So I guess that's the package for the task. Then, putting each sample data into one row (by taking advantage of 'list columns', i.e. a column with a list / dataframe per cell) might help keep things compact. Example: Fit a different model for each row of a list-columns data frame

  • Thanks! I understand that rsample can create k folds, and then create one row for each fold which defines the "assessment" (records that aren't in the fold) and "analysis" (the records that are in the fold). The approach described in the (very good) link you provided will create one model for each of the folds. The cross validation scheme I described actually need to create k-1 models for each of the folds. I know how I would do this with for loops, but wondering how easy it would be to do with tidymodels. – Zoltan Mar 01 '22 at 12:33
  • 1
    [`tune_grid()`](https://tune.tidymodels.org/reference/tune_grid.html) does what you're describing. It will fit each of the tuning combinations against each resampled data set – Mark Rieke Mar 01 '22 at 12:49
1

Yes, if I am understanding you correctly:

library(rsample)

car_split <- initial_split(mtcars)
cars_train <- training(car_split)
cars_test <- testing(car_split)

## CV folds for training set
vfold_cv(cars_train, v = 6)
#> #  6-fold cross-validation 
#> # A tibble: 6 × 2
#>   splits         id   
#>   <list>         <chr>
#> 1 <split [20/4]> Fold1
#> 2 <split [20/4]> Fold2
#> 3 <split [20/4]> Fold3
#> 4 <split [20/4]> Fold4
#> 5 <split [20/4]> Fold5
#> 6 <split [20/4]> Fold6


## CV folds for whole original data set
vfold_cv(mtcars, v = 6)
#> #  6-fold cross-validation 
#> # A tibble: 6 × 2
#>   splits         id   
#>   <list>         <chr>
#> 1 <split [26/6]> Fold1
#> 2 <split [26/6]> Fold2
#> 3 <split [27/5]> Fold3
#> 4 <split [27/5]> Fold4
#> 5 <split [27/5]> Fold5
#> 6 <split [27/5]> Fold6

Created on 2022-03-11 by the reprex package (v2.0.1)

You can create CV folds from any dataset you want, either a training set or an original whole data set. You might check out this chapter of our book for more on "spending your data budget".

Julia Silge
  • 10,848
  • 2
  • 40
  • 48
  • thanks for replying! I don't think this is what they/I mean. The way they use the data is that 100% of the data ends up being considered in the test set. let's say we have a total of 100 records. Instead of doing 80 training and 20 test, then splitting the 80 training into 4 folds of 20 What they do is split the 100 in 5 folds of 20. Then each fold of 20 will be considered a "test set" and the other 80 records will be the training set for that test set . – Zoltan May 10 '22 at 13:59
  • I believe that is [just regular cross-validation](https://www.tmwr.org/resampling.html#cv) with the whole dataset as what you are splitting (not a training set). If we are misunderstanding each other, it might help to create a diagram or use a small dataset to demonstrate how you would like to split your data. – Julia Silge May 10 '22 at 17:49
  • @JuliaSilge coming back to this as more people might stumble upon this, and because I was also confused by the tidymodels example on nested CV. I assume there are different understandings of how nested CV should be performed, but what Zoltan describes is a general understanding of nested CV from my perspective and not regular cross validation. The idea is, that we repeat split of test and training multiple times until all data has been used. This ensures that performance estimates using the test set are not flawed by this particular sample. – Felix Phl Aug 12 '22 at 13:14