Manual bootstrapping for confidence intervals using tidyverse only

Question

I have a grouped dataset and I am interested in summarising a column of counts (number of ___). To calculate the standard error for the summary, I want to bootstrap within groups and calculate the standard deviation of medians. I am struggling to figure out how to manually code this (resampling with replacement, and not functions like boot()), without using for loops (i.e., I am hoping for a purely tidyverse solution). If there is a way other than using *apply(), that would be preferred. Wrapping the whole process into a function would be great---either to be used in pipeline with, say, summarise(), or as a standalone function that can be applied to the grouped data.

An ad hoc dataset can be mtcars which I have grouped by gear. I am now interested in summarising the hp column using median and also obtaining confidence intervals for the same. I have already attempted a bunch of solutions suggested by slightly related threads on SO, like replicate()+across(), map()/pmap(), etc. but couldn't get them to work for my specific case.

library(tidyverse)

data <- mtcars %>% 
  select(gear, hp) %>% 
  group_by(gear)

> data
# A tibble: 32 x 2
# Groups:   gear [3]
    gear    hp
   <dbl> <dbl>
 1     4   110
 2     4   110
 3     4    93
 4     3   110
 5     3   175
 6     3   105
 7     3   245
 8     4    62
 9     4    95
10     4   123
# ... with 22 more rows

I am hoping for a way to integrate the bootstrap results with the simple summarisation as another column (SEs per group):

data2 <- data %>% 
  summarise(hp = median(hp))

While it may not make much sense to summarise horsepower by number of gears, and the distribution of hp might not be a typical Poisson, I think the coding solution for this example will apply to my specific case nonetheless.

EDIT 1

The solution need not be a clean and robust function. It can be just the lines of code required to obtain the bootstrapped SE value in each group for this specific case. The desired output is just the data2 object, where hp is the column of medians and hpse is the column of SEs.


    data2 <- data %>% 
      summarise(hp = median(hp),
            ### hpse = workingcode()
                )

If not possible to do it directly this way inside the summarise() call, it must at least be possible to later join the values to data2.

Related threads

Using `boot()`

I'm confused about saying you don't want `*apply()` functions but don't mind the purrr `map()` family. They are mostly equivalent - a solution using base can easily be updated to a tidyverse style. Furthermore, a solution would likely involve something like `map(1:B, ...)` which is the same as a for loop. — kybazzi, Jan 06 '22 at 06:44
@kybazzi Yes, I tried the `map()` family but couldn't figure out how to get it working for my case, so I am looking for other methods. The main reason I don't want `*apply()` functions is because of the different data object types that such a method involves, but I am currently considering it as one option. I wasn't clear enough in the post, but I am open to these methods if they can be adapted to a tidyverse style, but `for` loops I certainly wish to avoid (if possible). (Will edit post to make this clear.) — Karthik Thrikkadeeri, Jan 06 '22 at 06:59
An additional point for clarification is what exactly the solution should look like. As it stands this question seems a little too vague or open-ended - it can be a whole project to build a clean bootstrapping function that can be applied to a pipeline. Can you show an example of the desired output? — kybazzi, Jan 06 '22 at 07:01
@kybazzi I have edited post with the clarifications. I hope it helps. The desired output is simply the column of SEs calculated for every group using sampling within the group with replacement. Do let me know in case of further queries. Thanks! — Karthik Thrikkadeeri, Jan 06 '22 at 07:20
Thanks - your edit did clarify it. I've left you an answer and am happy to respond to address any questions. — kybazzi, Jan 08 '22 at 22:43

score 0 · Accepted Answer · answered Jan 06 '22 at 08:05

0

First we can make a bootstrap function:

boot_fn = function(x, fn = median, B = 1000) {
  1:B %>%
    # For each iteration, generate a sample of x with replacement
    map(~ x[sample(1:length(x), replace = TRUE)]) %>%
    # Obtain the fn estimate for each bootstrap sample
    map_dbl(fn) %>%
    # Obtain the standard error
    sd()
}

Note how I gave the parameter fn a default value of median, which gives you the opportunity to pass any numeric function you wish into boot_fn().

Now we can use the function as you originally asked:

mtcars %>% 
  group_by(gear) %>%
  summarise(
    hp_median = median(hp), 
    se = boot_fn(hp, fn = median)
  )

# A tibble: 3 x 3
   gear hp_median    se
  <dbl>     <dbl> <dbl>
1     3       180  13.2
2     4        94  15.2
3     5       175  70.3

The reason this works is because our data is grouped. For each group, a new value of x is sent to boot_fn(). In this case, three different values of x were passed, each being the hp values corresponding to each different value of gear.

This is easy to confirm if we just add a cat() statement in our function:

boot_fn = function(x, fn = median, B = 1000, verbose = FALSE) {
  if (verbose) cat("Hello, x is ", x, "\n")
  1:B %>%
    # For each iteration, generate a sample of x with replacement
    map(~ x[sample(1:length(x), replace = TRUE)]) %>%
    # Obtain the fn estimate for each bootstrap sample
    map_dbl(fn) %>%
    # Obtain the standard error
    sd()
}

data %>%
  summarise(
    hp_median = median(hp), 
    se = boot_fn(hp, fn = median, verbose = TRUE)
  )

Output:

Hello, x is 110 175 105 245 180 180 180 205 215 230 97 150 150 245 175 
Hello, x is 110 110 93 62 95 123 123 66 52 65 66 109 
Hello, x is 91 113 264 175 335 
# A tibble: 3 x 3
   gear hp_median    se
  <dbl>     <dbl> <dbl>
1     3       180  13.5
2     4        94  14.9
3     5       175  69.6

This function may break down when used on real-world data (due to things like NAs), but this is a good start.

answered Jan 06 '22 at 08:05

kybazzi

1,020
2
7

Thanks for this, it works for the current purpose. Is there any particular reason you used `map()` in the first step and not `map_dbl()`? (The latter can take vector inputs too, right?) – Karthik Thrikkadeeri Jan 10 '22 at 07:21
Additionally, I know this is going beyond the scope of my original question, but is it possible to briefly compare the performance of your solution with something like: `boot_se <- function(x, fn = median, B = 1000){replicate(B, do.call("fn", list(sample(x, n(), replace = T))), simplify = F) %>% unlist() %>% sd()}`? – Karthik Thrikkadeeri Jan 10 '22 at 07:25
Regarding `map()` vs. `map_dbl()`: I recommend you update the pipeline to `map()` and run until the end of just that line to see the difference. `map_dbl()` returns a numeric vector - each element represents the median of the bootstrap sample from 1 to `B`. Using `map()` would instead give a list, but again each element being the median of the bootstrap samples. In this case, we want to use `map_dbl()` to feed into `sd()`. – kybazzi Jan 10 '22 at 15:00
Regarding performance - you could try doing that yourself and I'll be happy to answer any questions if things don't work. In that code you sent, you probably need to use `sample(x, length(x), replace = TRUE)` since `n()` isn't available outside a dplyr pipeline. Here's an example of how to benchmark two functions: https://stackoverflow.com/questions/22515525/is-it-possible-to-speed-up-my-function-for-creating-a-correlation-matrix#22518603 – kybazzi Jan 10 '22 at 15:40
Actually, I meant the opposite, i.e., using `map_dbl()` in both steps. It returns a vector but it can also take a vector input (so shouldn't be an issue in the second step) at least from what I can understand from the documentation. However, I tried modifying the function as such and I am getting an error: `x Result 1 must be a single double, not a double vector of length 15` (this is with my actual data so length is different). I don't understand why this is happening, because the function is supposed to be able to take either lists or vectors as input. – Karthik Thrikkadeeri Jan 11 '22 at 01:08
I used `n()` because my current purpose is to use it solely within a pipeline. I checked performance with `system.time()` rather than `microbenchmark()` because my actual data is large and replications would take time. Your function is quicker. Could this be because the `1:B` is similar to how loops work, while `replicate()` is similar to `*apply()`? And lastly, any consideration aside from speed (and, in my case, applicability outside a pipeline) that may influence the choice of function? – Karthik Thrikkadeeri Jan 11 '22 at 01:59
1

Regarding `map()` vs. `map_dbl()`: you have to use `map()` there because each replication from 1 to `B` needs to have its own sample of `x`. This can only be stored in a list, which is the output of `map()`. On the other hand, `map_dbl()` should be used when each replication's output is a single number. That's why it's used in the next line - each replication from 1 to `B` calculates a median from its original sample, and we save the results in a single vector given by `map_dbl()`. Let me know if that's still unclear. – kybazzi Jan 11 '22 at 02:50
In terms of the best solution, speed-wise functions like `boot()` which allow for parallel processing are probably best. Other considerations really depend on your specific use-case. – kybazzi Jan 11 '22 at 02:54

score 0 · Answer 2 · answered Jan 11 '22 at 02:18

An alternative to @kybazzi's solution that fits in the pipeline workflow is this:

boot_se <- function(x, fn = median, B = 100){
  replicate(B,
            do.call("fn", list(sample(x, n(), replace = T))),
            simplify = F) %>% 
    unlist() %>% 
    sd()
}

It seems to be slower at times:


boot_fn = function(x, fn = median, B = 100) {
  1:B %>%
    # For each iteration, generate a sample of x with replacement
    map(~ x[sample(1:length(x), replace = TRUE)]) %>%
    # Obtain the fn estimate for each bootstrap sample
    map_dbl(fn) %>%
    # Obtain the standard error
    sd()
}


data1 <- mtcars %>% 
  select(gear, hp) %>% 
  group_by(gear)

data2 <- data %>% 
  summarise(hpmed = median(hp),
            hpse = boot_se(hp))

data3 <- data %>% 
  summarise(hpmed = median(hp),
            hpse = boot_fn(hp))

#######################################

library(microbenchmark)

microbenchmark((data %>% 
                 summarise(hpmed = median(hp),
                           hpse = boot_fn(hp))),
               (data %>% 
                  summarise(hpmed = median(hp),
                            hpse = boot_se(hp))))

# Output:

Unit: milliseconds
                                                          expr     min       lq
  (data %>% summarise(hpmed = median(hp), hpse = boot_fn(hp))) 14.5737 15.63690
  (data %>% summarise(hpmed = median(hp), hpse = boot_se(hp))) 20.6675 21.64715
     mean   median       uq     max neval
 22.23120 16.78140 25.85675 91.4154   100
 29.15338 22.68525 32.01430 87.6299   100

#######################################

microbenchmark(data2, data3, times = 1000)

# Output:

Unit: nanoseconds
  expr min    lq   mean median  uq  max neval
 data2   0 100.0 95.986    101 101 3501  1000
 data3   0   1.5 92.318    101 101 2700  1000

Manual bootstrapping for confidence intervals using tidyverse only

EDIT 1

Related threads

Using boot()

Using *apply()

Using for loop

Others

2 Answers2

Using `boot()`

Using `*apply()`

Using `for` loop