Block Bootstrapping using Tidymodels

Question

I have a monthly (Jan - Dec) data set for weather and crop yield. This data is collected for multiple years (2002 - 2019). My aim is to obtain bootstrapped slope coefficient of the affect of temperature in each month on yield gap. In bootstrapping, I want to block the year information in a way that the function should randomly sample data from a specific year in each bootstrap rather than choosing rows from mixed years.

I read some blogs and tried different methods but I am not confident about those. I tried to disect the bootstrapped splits to ensure if I am doing it correctly but I was not.

Here is the starting code:

# Load libraries
library(readxl)
library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
library(reprex)

# data 
ww_wt <- read_csv("https://raw.githubusercontent.com/MohsinRamay/yieldgap/main/ww_wt.csv")
#> New names:
#> * `` -> ...1
#> Rows: 1924 Columns: 20
#> -- Column specification --------------------------------------------------------
#> Delimiter: ","
#> chr   (3): ID, Location, Month
#> dbl  (16): ...1, Year, Latitude, Longitude, YieldTrt, YieldUntrt, Mildew, Ye...
#> date  (1): Date
#> 
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.

ww_wt %>% 
  select(Year, Month, gap, temp) %>%
  head()
#> # A tibble: 6 x 4
#>    Year Month       gap  temp
#>   <dbl> <chr>     <dbl> <dbl>
#> 1  2002 September 0.282 13.6 
#> 2  2002 October   0.282 13.3 
#> 3  2002 November  0.282  7.07
#> 4  2002 December  0.282  3.44
#> 5  2002 January   0.282  5.61
#> 6  2002 February  0.282  6.93

# Bootstrapping
set.seed(123)

boots <- ww_wt %>% 
  ungroup() %>% 
  select(Year, Month, gap, temp) %>%
  nest(data = -c(Month)) %>% 
  mutate(boots = map(data, ~bootstraps(.x, times = 100, apparent = FALSE))) %>%
  unnest(boots) %>% 
  mutate(model = map(splits, ~lm(gap ~ temp, data = analysis(.))),
         coefs = map(model, tidy))

^{Created on 2022-01-04 by the reprex package (v2.0.1)}

I am nesting Months because I want to get slopes for each month separately. Also, the data for each Year has different sample size n because of different number of locations each year.

Can you edit this question to create a [reprex](https://www.tidyverse.org/help/), a [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example), so we can provide more effective help? — Julia Silge, Jan 01 '22 at 18:45
One idea I have based on what you have shared is to use `strata` argument for `bootstraps()`, and stratify on your year variable (you may want to convert it from numeric to a factor so the stratification is within each year, rather than within quartiles of the year). You might also check out the [sliding resampling options](https://rsample.tidymodels.org/reference/slide-resampling.html), or the [CV options in timetk](https://business-science.github.io/timetk/reference/time_series_cv.html). — Julia Silge, Jan 01 '22 at 18:45
@JuliaSilge, I have edited the question and added the rendered reprex. — Mohsin Ramay, Jan 02 '22 at 21:21
That data that you're loading is from your own C:\ drive i.e. it can't be imported by anyone except you, making it not reproducible. Can you upload your data to a site, say, Github? To read data from Github, [here's an example](https://juliasilge.com/blog/spice-girls/#explore-data) — Desmond, Jan 04 '22 at 06:58
@Desmond, thanks for pointing it out. I have added the github link for the data and it should work now. Before that, I had added a OneDrive link for this data but, indeed, it was not a good approach. I have fixed it now. — Mohsin Ramay, Jan 04 '22 at 23:32
I'm not sure I understand what you are trying to do very well. You want to create bootstrap resamples for each single month/year combination? There are not many observations in each individual month. For example, there are only 9 observations in September 2002. — Julia Silge, Jan 06 '22 at 22:29
@JuliaSilge, I want to create bootstrap resamples for each single month. The resampling should be done in a way that rather than sampling random observations, it should sample data for a whole year and then rearrange it. For instance, I have ordered data from 2002, 2003, 2004,...,2020. But when resampled, lets say, it can be sampled and arranged as 2005, 2012, 2009,...2003 etc., for one bootstrap. Does this help? — Mohsin Ramay, Jan 07 '22 at 00:36
@BenBolker Yes. Here is the updated code: ` ww_wt %>% select(Year, Month, gap, temp) %>% group_bootstraps(group = "Year", times = 10, apparent = FALSE) %>% mutate(splits = map(splits, analysis)) %>% unnest(splits) ` — Mohsin Ramay, Mar 07 '23 at 01:45

score 3 · Accepted Answer · answered Jan 08 '22 at 23:03

Thanks for providing these hints Julia. I think I have figured this out by just adding some extra lines of code.

library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#>   method                   from   
#>   required_pkgs.model_spec parsnip
library(viridis)
#> Loading required package: viridisLite
#> 
#> Attaching package: 'viridis'
#> The following object is masked from 'package:scales':
#> 
#>     viridis_pal

ww_wt <- read_csv("https://raw.githubusercontent.com/MohsinRamay/yieldgap/main/ww_wt.csv")
#> New names:
#> * `` -> ...1
#> Rows: 1924 Columns: 20
#> -- Column specification --------------------------------------------------------
#> Delimiter: ","
#> chr   (3): ID, Location, Month
#> dbl  (16): ...1, Year, Latitude, Longitude, YieldTrt, YieldUntrt, Mildew, Ye...
#> date  (1): Date
#> 
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.

set.seed(123)

# Block bootstrapping
boots <- ww_wt %>% 
  ungroup() %>% 
  select(Year, Month, gap, temp) %>%
  pivot_wider(names_from = Month, values_from = temp, values_fn = mean) %>% 
  bootstraps(times = 10, apparent = FALSE) %>% 
  mutate(splits = map(splits, analysis)) %>%
  unnest(splits) %>% 
  group_by(id) %>% 
  mutate(row = row_number()) %>% 
  pivot_longer(names_to = "Month", values_to = "temp", cols = September:August)

# Bootstraps
boots %>%
  group_by(id) %>% 
  ggplot(aes(x = id, y = row, fill = Year)) +
  geom_tile() +
  scale_fill_viridis(option = "B", direction = 1) +
  labs(x = NULL) +
  facet_wrap(~Month) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.4))

The above plots clearly indicates that, within each bootstrap, data from each Year is being randomly sampled with replacement.

^{Created on 2022-01-08 by the reprex package (v2.0.1)}

score 2 · Answer 2 · answered Jan 08 '22 at 04:19

We don't currently have support for grouped or blocked bootstrapping; we are tracking interest in more group-based methods here.

If you want to create a resampling scheme that holds out whole groups of data, you might check out group_vfold_cv() (maybe together with nested_cv()?) to see if it fits your needs in the meantime. It results in a resampling scheme that looks like this:

library(tidyverse)
library(tidymodels)

ww_wt <- read_csv("https://raw.githubusercontent.com/MohsinRamay/yieldgap/main/ww_wt.csv")
#> New names:
#> * `` -> ...1
#> Rows: 1924 Columns: 20
#> ── Column specification ────────────────────────────────────────────────────────
#> Delimiter: ","
#> chr   (3): ID, Location, Month
#> dbl  (16): ...1, Year, Latitude, Longitude, YieldTrt, YieldUntrt, Mildew, Ye...
#> date  (1): Date
#> 
#> ℹ Use `spec()` to retrieve the full column specification for this data.
#> ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

set.seed(123)
folds <-
  ww_wt %>%
  filter(Month == "September") %>%
  select(Year, Month, gap, temp) %>%
  group_vfold_cv(
    group = Year,
    v = 7
  )

folds
#> # Group 7-fold cross-validation 
#> # A tibble: 7 × 2
#>   splits           id       
#>   <list>           <chr>    
#> 1 <split [135/26]> Resample1
#> 2 <split [137/24]> Resample2
#> 3 <split [133/28]> Resample3
#> 4 <split [132/29]> Resample4
#> 5 <split [142/19]> Resample5
#> 6 <split [144/17]> Resample6
#> 7 <split [143/18]> Resample7

tidy(folds) %>%
  ggplot(aes(x = Resample, y = Row, fill = Data)) +
  geom_tile() + scale_fill_brewer()

^{Created on 2022-01-07 by the reprex package (v2.0.1)}

You can bump up v higher if you want, and you can nest by Month first to do this for each Month.

Hello! I am trying to find some methods to create resamples for time series data and chanced upon your answer to this post. Could you briefly tell me what's the difference between group v-fold cv and block bootstrap? I couldn't really find any information regarding group v-fold cv on the web but from your answer, it seems to me like they are pretty much the same? — Ethan Mark, Mar 16 '23 at 11:55
For time series, I think [you will want to use these kinds of methods](https://rsample.tidymodels.org/reference/slide-resampling.html). That is a bit different from grouped resampling, [which is discussed pretty nicely here](https://rsample.tidymodels.org/articles/Common_Patterns.html#grouped-resampling). — Julia Silge, Mar 17 '23 at 18:17

Block Bootstrapping using Tidymodels

2 Answers2