I have a monthly (Jan - Dec) data set for weather and crop yield. This data is collected for multiple years (2002 - 2019). My aim is to obtain bootstrapped slope coefficient of the affect of temperature in each month on yield gap. In bootstrapping, I want to block the year information in a way that the function should randomly sample data from a specific year in each bootstrap rather than choosing rows from mixed years.
I read some blogs and tried different methods but I am not confident about those. I tried to disect the bootstrapped splits to ensure if I am doing it correctly but I was not.
Here is the starting code:
# Load libraries
library(readxl)
library(tidyverse)
library(tidymodels)
#> Registered S3 method overwritten by 'tune':
#> method from
#> required_pkgs.model_spec parsnip
library(reprex)
# data
ww_wt <- read_csv("https://raw.githubusercontent.com/MohsinRamay/yieldgap/main/ww_wt.csv")
#> New names:
#> * `` -> ...1
#> Rows: 1924 Columns: 20
#> -- Column specification --------------------------------------------------------
#> Delimiter: ","
#> chr (3): ID, Location, Month
#> dbl (16): ...1, Year, Latitude, Longitude, YieldTrt, YieldUntrt, Mildew, Ye...
#> date (1): Date
#>
#> i Use `spec()` to retrieve the full column specification for this data.
#> i Specify the column types or set `show_col_types = FALSE` to quiet this message.
ww_wt %>%
select(Year, Month, gap, temp) %>%
head()
#> # A tibble: 6 x 4
#> Year Month gap temp
#> <dbl> <chr> <dbl> <dbl>
#> 1 2002 September 0.282 13.6
#> 2 2002 October 0.282 13.3
#> 3 2002 November 0.282 7.07
#> 4 2002 December 0.282 3.44
#> 5 2002 January 0.282 5.61
#> 6 2002 February 0.282 6.93
# Bootstrapping
set.seed(123)
boots <- ww_wt %>%
ungroup() %>%
select(Year, Month, gap, temp) %>%
nest(data = -c(Month)) %>%
mutate(boots = map(data, ~bootstraps(.x, times = 100, apparent = FALSE))) %>%
unnest(boots) %>%
mutate(model = map(splits, ~lm(gap ~ temp, data = analysis(.))),
coefs = map(model, tidy))
Created on 2022-01-04 by the reprex package (v2.0.1)
I am nesting Months
because I want to get slopes for each month separately. Also, the data for each Year has different sample size n
because of different number of locations each year.