How to bootstrap dataset in R which is blocked by a factor?

Question

I want to perform bootstrap on this data set. Notice that the data has two factors: replicate and level, and two variables high.density and low.density that need to be regressed. I want to perform a bootstrap on this data-set but the replacements can occur only within the nested factor of replicate and level.

replicate level high.density low.density
    1     low    14          36
    1     low    54          31
    1     mid    82          10
    1     mid    24          NA
    2     low    12          28
    2     low    11          45
    2     mid    12          17
    2     mid    NA          24
    2      up    40          10
    2      up    NA           5
    2      up    20           2

For instance, in replicate/ level: 1/low the low.density 31 and 36 can be interchanged (or high.density interchanged) so the head of that dataset may look like:

replicate level high.density low.density
    1     low    14          31
    1     low    54          36
    1     mid    82          10
    1     mid    24          NA

I then want to estimate the linear regression (glm) from this dataset. I would appreciate any feedback on trying to achieve this.

##DATA FRAME (credits: caldwellst)

    df <- structure(list(replicate = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2), level = c("low", "low", "mid", "mid", "low", "low", "mid", "mid", "up", "up", "up"), high.density = c(14, 54, 82, 24, 12, 11, 12, NA, 40, NA, 20), low.density = c(36, 31, 10, 
    NA, 28, 45, 17, 24, 10, 5, 2)), class = c("spec_tbl_df","tbl_df","tbl", "data.frame"), row.names = c(NA, -11L), spec = structure(list(cols = list(replicate = structure(list(), class = c("collector_double", "collector")), level = structure(list(), class = c("collector_character","collector")), high.density = structure(list(), class = c("collector_double","collector")), low.density = structure(list(), class = c("collector_double", 
    "collector"))), default = structure(list(), class = c("collector_guess", "collector")), skip = 1L), class = "col_spec"))
    
    df$replicate <- as.factor(as.numeric(df$replicate))
    df$level <- as.factor(as.character(df$level)

)

@Dion A plain old GLM. I tried the method you suggested but it doesn't seem to be working. — Rspacer, Jan 20 '22 at 14:11

Dion Groothof · Accepted Answer · 2022-01-20T21:36:10.987

We may exploit split and do the sampling according to unique combinations of replicate and level. We could repeat this process B times.

df_shuffle <- function(DF) {
  my_split <- split(DF, f = ~ DF$replicate + DF$level)
  shuffle <- lapply(my_split, function(x) {
    nrX <- nrow(x)
    cbind(x[, c('replicate', 'level')],
          high.density = x[sample(seq_len(nrX), replace = TRUE), 'high.density'],
          low.density = x[sample(seq_len(nrX), replace = TRUE), 'low.density'])
  })
  DF_new <- do.call(rbind, shuffle)
  rownames(DF_new) <- NULL
  return(DF_new)
}

B <- 1000L
df_list <- replicate(B, df_shuffle(df), simplify = FALSE)
# ---------------------------------------------------
> df_list[[B]]
   replicate level high.density low.density
1          1   low           54          36
2          1   low           54          36
3          2   low           12          45
4          2   low           12          28
5          1   mid           24          10
6          1   mid           82          10
7          2   mid           NA          17
8          2   mid           12          17
9          2    up           20          10
10         2    up           40          10
11         2    up           20           5

Because the original data contains missing observations, we either have to multiply impute them or opt to lisewise delete them. For now, let's perform the latter option.

# listwise delete missing observations
df_list <- lapply(df_list, function(x) x[complete.cases(x), ])

Finally, we perform a linear regression on each shuffled dataset and store the B coefficients in out.

row_bind <- function(x) data.frame(do.call(rbind, x))
out <- row_bind(
  lapply(df_list, function(x) lm(high.density ~ low.density, data = x)$coef)
)

## out <- row_bind(
##   lapply(df_list, function(x) glm(replicate ~ low.density, data = x,
##                            family = binomial())$coef)
## )

# -------------------------------------------------------------------
> dim(out)
[1] 1000    2

Output

> head(out)
  X.Intercept. low.density
1     13.74881   0.2804738
2     20.01074  -0.2095672
3     30.26643  -0.2946373
4     29.19541  -0.2752761
5     37.76273  -0.4555651
6     37.72250  -0.1548349

The code required to create this image can be found here.

I am unable to run the df_shuffle bit of the code and I'm using R (on R studio, it shouldn't matter). Also, the NA's are pretty important to the dataset. I have attached the original data in the answer now. — Rspacer, Jan 20 '22 at 14:10
After the new combinations has been generated I'm okay deleting the rows NAs for the individual datasets/ lists — Rspacer, Jan 20 '22 at 14:12
Unfortunately I am unable to run your script. I tried on an online R forum https://rdrr.io/snippets/ and it didn't work there either `Error: unexpected input in "df_shuffle <- \"` Are you sourcing the script or doing anything else? — Rspacer, Jan 20 '22 at 16:35
This almost certainly is due to the fact that you have an older version of `R`, in which case you should use `function(x)` instead of `\(x)`. I edited the post accordingly. Give it another try. — Dion Groothof, Jan 20 '22 at 16:53
Thank you it did! I am still not sure how to get the p-value/ coefficients after the bootstrapping. Maybe that's for a separate question. This was super helpful though. Thank you. — Rspacer, Jan 21 '22 at 12:28

score 0 · Answer 2 · answered Jan 20 '22 at 13:17

Here's a solution using dplyr, purrr, and tidyr. First nest the numeric data, and then sample each of the unique combinations of replicate and level in the data. Then within those, bootstrap the unique values of the densities and then unnest for final data frame.

# library(tidyverse)
library(dplyr)
library(tidyr)
library(purrr)

df %>%
  nest(data = ends_with("density")) %>%
  slice_sample(n = 500, replace = TRUE) %>%
  mutate(data = map(data, ~summarize(.x, across(.fns = sample, size = 1)))) %>%
  unnest(cols = data)
#> # A tibble: 500 × 4
#>    replicate level high.density low.density
#>        <dbl> <chr>        <dbl>       <dbl>
#>  1         1 low             54          31
#>  2         2 mid             12          24
#>  3         1 mid             24          10
#>  4         2 up              20           2
#>  5         2 mid             12          24
#>  6         2 mid             12          24
#>  7         1 mid             82          10
#>  8         2 up              NA           2
#>  9         1 low             14          36
#> 10         2 mid             12          17
#> # … with 490 more rows

Data

df <- structure(list(replicate = c(1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2), 
    level = c("low", "low", "mid", "mid", "low", "low", "mid", 
    "mid", "up", "up", "up"), high.density = c(14, 54, 82, 24, 
    12, 11, 12, NA, 40, NA, 20), low.density = c(36, 31, 10, 
    NA, 28, 45, 17, 24, 10, 5, 2)), class = c("spec_tbl_df", 
"tbl_df", "tbl", "data.frame"), row.names = c(NA, -11L), spec = structure(list(
    cols = list(replicate = structure(list(), class = c("collector_double", 
    "collector")), level = structure(list(), class = c("collector_character", 
    "collector")), high.density = structure(list(), class = c("collector_double", 
    "collector")), low.density = structure(list(), class = c("collector_double", 
    "collector"))), default = structure(list(), class = c("collector_guess", 
    "collector")), skip = 1L), class = "col_spec"))

Thank you. I tried the solution but I get this error `Error: Problem with `mutate()` column `data`. ℹ `data = map(data, ~summarize(.x, across(.fns = sample, size = 1)))`. x argument "FUN" is missing, with no default` — Rspacer, Jan 20 '22 at 13:23

How to bootstrap dataset in R which is blocked by a factor?

2 Answers2

Data

Linked