I have a grouped dataset and I am interested in summarising a column of counts (number of ___). To calculate the standard error for the summary, I want to bootstrap within groups and calculate the standard deviation of medians. I am struggling to figure out how to manually code this (resampling with replacement, and not functions like boot()
), without using for
loops (i.e., I am hoping for a purely tidyverse
solution). If there is a way other than using *apply()
, that would be preferred. Wrapping the whole process into a function would be great---either to be used in pipeline with, say, summarise()
, or as a standalone function that can be applied to the grouped data.
An ad hoc dataset can be mtcars
which I have grouped by gear
. I am now interested in summarising the hp
column using median and also obtaining confidence intervals for the same. I have already attempted a bunch of solutions suggested by slightly related threads on SO, like replicate()
+across()
, map()
/pmap()
, etc. but couldn't get them to work for my specific case.
library(tidyverse)
data <- mtcars %>%
select(gear, hp) %>%
group_by(gear)
> data
# A tibble: 32 x 2
# Groups: gear [3]
gear hp
<dbl> <dbl>
1 4 110
2 4 110
3 4 93
4 3 110
5 3 175
6 3 105
7 3 245
8 4 62
9 4 95
10 4 123
# ... with 22 more rows
I am hoping for a way to integrate the bootstrap results with the simple summarisation as another column (SEs per group):
data2 <- data %>%
summarise(hp = median(hp))
While it may not make much sense to summarise horsepower by number of gears, and the distribution of hp
might not be a typical Poisson, I think the coding solution for this example will apply to my specific case nonetheless.
EDIT 1
The solution need not be a clean and robust function. It can be just the lines of code required to obtain the bootstrapped SE value in each group for this specific case. The desired output is just the data2
object, where hp
is the column of medians and hpse
is the column of SEs.
data2 <- data %>%
summarise(hp = median(hp),
### hpse = workingcode()
)
If not possible to do it directly this way inside the summarise()
call, it must at least be possible to later join the values to data2
.
Related threads
Using boot()
How to perform a bootstrap and find 95% confidence interval for the median of a dataset
Bootsrapping a statistic in a nested data column and retrieve results in tidy format