I'm trying to understand why the rsample::bootstraps
function apparently stores the entire data set for each bootstrap sample. I was expecting the function would just store the dataset once, along with the bootstrap indices for each resample. In the following you can see the basic structure, which is repeated for each resample:
> set.seed(1)
> test <- rsample::bootstraps(mtcars[, 1:3], times = 2)
> str(test)
bootstraps [2 × 2] (S3: bootstraps/rset/tbl_df/tbl/data.frame)
$ splits:List of 2
..$ :List of 4
.. ..$ data :'data.frame': 32 obs. of 3 variables:
.. .. ..$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
.. .. ..$ cyl : num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
.. .. ..$ disp: num [1:32] 160 160 108 258 360 ...
.. ..$ in_id : int [1:32] 25 4 7 1 2 29 23 11 14 18 ...
.. ..$ out_id: logi NA
.. ..$ id : tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
.. .. ..$ id: chr "Bootstrap1"
.. ..- attr(*, "class")= chr [1:2] "rsplit" "boot_split"
..$ :List of 4
.. ..$ data :'data.frame': 32 obs. of 3 variables:
.. .. ..$ mpg : num [1:32] 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ...
.. .. ..$ cyl : num [1:32] 6 6 4 6 8 6 8 4 4 6 ...
.. .. ..$ disp: num [1:32] 160 160 108 258 360 ...
.. ..$ in_id : int [1:32] 25 12 15 1 20 3 6 10 10 6 ...
.. ..$ out_id: logi NA
.. ..$ id : tibble [1 × 1] (S3: tbl_df/tbl/data.frame)
.. .. ..$ id: chr "Bootstrap2"
.. ..- attr(*, "class")= chr [1:2] "rsplit" "boot_split"
$ id : chr [1:2] "Bootstrap1" "Bootstrap2"
- attr(*, "times")= num 2
- attr(*, "apparent")= logi FALSE
- attr(*, "strata")= logi FALSEbootstraps [1 × 2] (S3:
The $data
item appears to be repeated for additional resamples and the resample indices which vary are stored in in_id
. The obvious cost is that the size of the object grows in proportion to the data size times the number of resamples. The size of a single resample from object.size(test)
is 7800 bytes. For 200 resamples it's 1236824 bytes.