0

I am running a model using tidymodels, where split the data by group and run regressions on each individual dataframe. This works well. However, now I also need to bootstrap my results. I'm not sure how to build this into my existing code.

My original code looks something like this:

library(dplyr)

year <- rep(2014:2018, length.out=10000)
group <- sample(c(0,1,2,3,4,5,6), replace=TRUE, size=10000)
value <- sample(10000, replace=T)
female <- sample(c(0,1), replace=TRUE, size=10000)
smoker <- sample(c(0,1), replace=TRUE, size=10000)
dta <- data.frame(year=year, group=group, value=value, female=female, smoker=smoker)

# cut the dataset into list
table_list <- dta %>%
  group_by(year, group) %>%
  group_split()

# fit model per subgroup
model_list <- lapply(table_list, function(x) glm(smoker ~ female, data=x,
                                                 family=binomial(link="probit")))
# predict
pred_list <- lapply(model_list, function(x) predict.glm(x, type = "response"))

I would like to bootstrap with replacement to obtain the bootstrapped predicted values. My gut feeling is that I should split the dataset further by creating random samples when I create the table_list. But how exactly do I do that?

Thanks for your help.

camille
  • 16,432
  • 18
  • 38
  • 60
Stata_user
  • 562
  • 3
  • 14
  • Did you consider the [`boot` package](https://www.rdocumentation.org/packages/boot/versions/1.3-28/topics/boot)? – dario Oct 29 '21 at 12:59
  • Does this answer your question? [Bootstrap resampling and tidy regression models with grouped/nested data](https://stackoverflow.com/questions/68946020/bootstrap-resampling-and-tidy-regression-models-with-grouped-nested-data) – dario Oct 29 '21 at 13:00
  • @dario: This is a great start, but at the end I need the actual predicted values. Thanks. – Stata_user Nov 01 '21 at 15:42
  • 1
    *"This is a great start, but at the end I need the actual predicted values"* .. I don't understand why you think the answers presented in the two links above would not give predictions, that's kind of the reason for any bootstrap procedure ;). The packages just help you with the sampling (by the way, fyi:, bootstrapping is **by definition** sampling with replacement ;) ) – dario Nov 01 '21 at 16:28
  • @dario: I'm not sure we are talking about the same thing. I need to calculate the predicted(fitted) values for each bootstrapped sample. I'm sure I could get there using the links above, but I just don't know how to go about it and there is little in the documentation on how to predict using the boot package. – Stata_user Nov 01 '21 at 19:00

1 Answers1

1

This is fairly complex, with the grouping and the bootstrapping, so I would probably approach it like this, using map() two layers deep:

library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip

year <- rep(2014:2018, length.out=10000)
group <- sample(c(0,1,2,3,4,5,6), replace=TRUE, size=10000)
value <- sample(10000, replace=T)
female <- sample(c(0,1), replace=TRUE, size=10000)
smoker <- sample(c(0,1), replace=TRUE, size=10000)
dta <- tibble(year=year, group=group, value=value, female=female, smoker=smoker)


glm_boot_mods <- 
  dta %>%
  nest(data = c(-year, -group)) %>%
  mutate(boots = map(
    data,  
    ~ bootstraps(., times = 20) %>%
      mutate(model = map(.$splits, ~ glm(smoker ~ female, data = analysis(.x),
                                         family = binomial(link = "probit"))),
             preds = map2(model, .$splits, ~predict(.x, newdata = assessment(.y))))
    ))


glm_boot_mods
#> # A tibble: 35 × 4
#>     year group data               boots                
#>    <int> <dbl> <list>             <list>               
#>  1  2014     1 <tibble [288 × 3]> <bootstraps [20 × 4]>
#>  2  2015     4 <tibble [273 × 3]> <bootstraps [20 × 4]>
#>  3  2016     3 <tibble [301 × 3]> <bootstraps [20 × 4]>
#>  4  2017     2 <tibble [282 × 3]> <bootstraps [20 × 4]>
#>  5  2018     0 <tibble [276 × 3]> <bootstraps [20 × 4]>
#>  6  2014     3 <tibble [279 × 3]> <bootstraps [20 × 4]>
#>  7  2016     2 <tibble [314 × 3]> <bootstraps [20 × 4]>
#>  8  2018     1 <tibble [296 × 3]> <bootstraps [20 × 4]>
#>  9  2014     0 <tibble [304 × 3]> <bootstraps [20 × 4]>
#> 10  2015     6 <tibble [288 × 3]> <bootstraps [20 × 4]>
#> # … with 25 more rows

The first map() creates the bootstrap resamples for each grouping, and then we go one layer deeper and for each resample fit a model and predict for the heldout observations for that resample. You can see what that looks like inside here for the first group:

glm_boot_mods %>%
  head(1) %>% 
  pull(boots)
#> [[1]]
#> # Bootstrap sampling 
#> # A tibble: 20 × 4
#>    splits            id          model  preds      
#>    <list>            <chr>       <list> <list>     
#>  1 <split [288/111]> Bootstrap01 <glm>  <dbl [111]>
#>  2 <split [288/93]>  Bootstrap02 <glm>  <dbl [93]> 
#>  3 <split [288/103]> Bootstrap03 <glm>  <dbl [103]>
#>  4 <split [288/106]> Bootstrap04 <glm>  <dbl [106]>
#>  5 <split [288/109]> Bootstrap05 <glm>  <dbl [109]>
#>  6 <split [288/109]> Bootstrap06 <glm>  <dbl [109]>
#>  7 <split [288/92]>  Bootstrap07 <glm>  <dbl [92]> 
#>  8 <split [288/111]> Bootstrap08 <glm>  <dbl [111]>
#>  9 <split [288/99]>  Bootstrap09 <glm>  <dbl [99]> 
#> 10 <split [288/111]> Bootstrap10 <glm>  <dbl [111]>
#> 11 <split [288/102]> Bootstrap11 <glm>  <dbl [102]>
#> 12 <split [288/104]> Bootstrap12 <glm>  <dbl [104]>
#> 13 <split [288/115]> Bootstrap13 <glm>  <dbl [115]>
#> 14 <split [288/111]> Bootstrap14 <glm>  <dbl [111]>
#> 15 <split [288/108]> Bootstrap15 <glm>  <dbl [108]>
#> 16 <split [288/110]> Bootstrap16 <glm>  <dbl [110]>
#> 17 <split [288/110]> Bootstrap17 <glm>  <dbl [110]>
#> 18 <split [288/111]> Bootstrap18 <glm>  <dbl [111]>
#> 19 <split [288/103]> Bootstrap19 <glm>  <dbl [103]>
#> 20 <split [288/109]> Bootstrap20 <glm>  <dbl [109]>

Created on 2021-11-02 by the reprex package (v2.0.1)

Notice that there are predictions for the heldout observations for each resample. Depending on what you want to do, you can use unnest() on the columns of glm_boot_mods you need to handle next.

Julia Silge
  • 10,848
  • 2
  • 40
  • 48