bootstrapped confidence intervals with dplyr

Question

Looking at the answer here How can I estimate bootstrapped intervals? This question was asked on the ggplot2 list as well.

library(dplyr)
mtcars %>%
group_by(vs) %>%
summarise(mean.mpg = mean(mpg, na.rm = TRUE),
    sd.mpg = sd(mpg, na.rm = TRUE),
    n.mpg = n()) %>%
mutate(se.mpg = sd.mpg / sqrt(n.mpg),
 lower.ci.mpg = mean.mpg - qt(1 - (0.05 / 2), n.mpg - 1) * se.mpg,
 upper.ci.mpg = mean.mpg + qt(1 - (0.05 / 2), n.mpg - 1) * se.mpg)

Ben Bolker · Answer 1 · 2019-03-20T00:13:53.717

7

The Hmisc package has a function smean.cl.boot to compute simple bootstrap confidence intervals easily. The hardest part (IMO) is incorporating the multiple outputs of this result (the function returns a 3-element numeric vector) into a dplyr workflow (see dplyr::mutate to add multiple values)

library(Hmisc)  ## optional if using Hmisc:: below
library(dplyr)
mtcars %>%
  group_by(vs) %>%
  do(data.frame(rbind(Hmisc::smean.cl.boot(.$mpg))))

The new columns are labeled just Mean, Lower, Upper, but an additional setNames call would fix that ...

If doing a lot of this,

bootf <- function(x,var="mpg") {
    newstuff <- rbind(Hmisc::smean.cl.boot(x[[var]])) %>%
         data.frame %>%
         setNames(paste(var,c("mean","lwr","upr"),sep="_"))
    return(newstuff)
}
mtcars %>% group_by(vs) %>% do(bootf(.))
mtcars %>% group_by(cyl) %>% do(bootf(.))

edited Mar 20 '19 at 00:13

answered Jul 24 '16 at 18:16

Ben Bolker

211,554
25
370
453

@eipi10; thanks, it works well, however, is there a way to see the dataframe created by sample? – Salvador Jul 25 '16 at 04:05
@shayaa; your code seems to work well, however, if I try to group by cyl for example, I get nan's on the final summarized dataframe. – Salvador Jul 25 '16 at 04:07
BenBolker; Nice and compact code but for some reason I am getting 32 records instead of just the summarized final 3. I am playing around with it but no luck yet. I added another group_by(cyl) at the end but it crashes. – Salvador Jul 25 '16 at 04:09
@Salvador see my updated answer. (For future reference, it's better to place comments beneath the answer to which they apply, rather than putting all your comments under one answer.) – eipi10 Jul 25 '16 at 17:09
Thanks Ben, that does what I need as well. – Salvador Jul 25 '16 at 19:58
While I appreciate the sentiment, StackOverflow deprecates [using comments to say "thank you"](http://meta.stackoverflow.com/questions/258004/should-thank-you-comments-be-flagged?lq=1); if this answer was useful you can upvote it (if you have sufficient reputation), and in any case if it answers your question satisfactorily you are encouraged to click the check-mark to accept it. – Ben Bolker Jul 25 '16 at 19:59
`smean.cl.boot` appears to have been moved to the `Hmisc` package. – Axeman Jun 07 '17 at 09:47

eipi10 · Answer 2 · 2017-10-13T00:52:05.497

ORIGINAL ANSWER: Bootstrapping a single column

The code below includes a simple bootstrapping function plus some additional code to return an informative data frame:

my_boot = function(x, times=1000) {

   # Get column name from input object
   var = deparse(substitute(x))
   var = gsub("^\\.\\$","", var)

  # Bootstrap 95% CI
  cis = quantile(replicate(times, mean(sample(x, replace=TRUE))), probs=c(0.025,0.975))

  # Return data frame of results
  data.frame(var, n=length(x), mean=mean(x), lower.ci=cis[1], upper.ci=cis[2])
}

mtcars %>%
  group_by(vs) %>%
  do(my_boot(.$mpg))

     vs    var     n     mean lower.ci upper.ci
  <dbl> <fctr> <int>    <dbl>    <dbl>    <dbl>
1     0    mpg    18 16.61667 15.14972 18.06139
2     1    mpg    14 24.55714 22.36357 26.80750

UPDATE: Bootstrapping any selection of columns

Based on your comments, here is an updated method to get bootsrapped confidence intervals for any selection of columns:

library(reshape2)
library(tidyr)

my_boot = function(x, times=1000) {

  # Bootstrap 95% CI
  cis = quantile(replicate(times, mean(sample(x, replace=TRUE))), probs=c(0.025,0.975))

  # Return results as a data frame
  data.frame(mean=mean(x), lower.ci=cis[1], upper.ci=cis[2])
}

mtcars %>%
  group_by(vs) %>%
  do(as.data.frame(apply(., 2, my_boot))) %>% 
  melt(id.var="vs") %>%
  separate(variable, sep="\\.", extra="merge", into=c("col","stat")) %>%
  dcast(vs + col ~ stat, value.var="value")

   vs  col    lower.ci        mean    upper.ci
1   0   am   0.1111111   0.3333333   0.5555556
2   0 carb   3.0000000   3.6111111   4.2777778
3   0  cyl   6.8888889   7.4444444   7.8888889
4   0 disp 262.3205556 307.1500000 352.4481944
5   0 drat   3.1877639   3.3922222   3.6011528
6   0 gear   3.2222222   3.5555556   3.9444444
7   0   hp 164.0500000 189.7222222 218.5625000
8   0  mpg  14.9552778  16.6166667  18.3225000
9   0 qsec  16.1888750  16.6938889  17.1744583
10  0   vs   0.0000000   0.0000000   0.0000000
11  0   wt   3.2929569   3.6885556   4.0880069
12  1   am   0.2142857   0.5000000   0.7857143
13  1 carb   1.2857143   1.7857143   2.3571429
14  1  cyl   4.1428571   4.5714286   5.0000000
15  1 disp 105.5703571 132.4571429 161.4657143
16  1 drat   3.5992143   3.8592857   4.1100000
17  1 gear   3.5714286   3.8571429   4.1428571
18  1   hp  79.7125000  91.3571429 103.2142857
19  1  mpg  21.8498214  24.5571429  27.3289286
20  1 qsec  18.7263036  19.3335714  20.0665893
21  1   vs   1.0000000   1.0000000   1.0000000
22  1   wt   2.2367000   2.6112857   2.9745571

Other updates to answer questions in the comments

UPDATE: To answer your comment to me in @BenBolker's answer: If you want the results returned by sample, you can do this:

boot.dat = replicate(1000, sample(mtcars$mpg[mtcars$vs==1], replace=TRUE))

This will return a matrix with 1000 columns, each of which will be a separate bootstrap sample of mtcars$mpg for vs==1. You could also do:

boot.by.vs = sapply(split(mtcars, mtcars$vs), function(df) {
   replicate(1000, sample(df$mpg, replace=TRUE))
}, simplify=FALSE)

This will return a list where the first list element is the matrix of bootstrap samples for vs==0 and the second is for vs==1.

UPDATE 2: To answer your second comment, here's how to bootstrap the whole data frame (and assuming you want to save all the copies, rather than summarise them. The code below returns a list of 1000 bootstrapped versions of mtcars1. This list will be huge if you have a lot of data, so you'll probably just want to keep summary results, like column means, for each bootstrap sample.

boot.df = lapply(1:1000, function(i) mtcars[sample(1:nrow(mtcars), replace=TRUE), ])

Thanks eipy10; it didn't allow me to put the comment right below your answer because the lack of privileges. No it allows me :)...Thanks for your reply. — Salvador, Jul 25 '16 at 18:14
eipi10; What if I want to bootstrap the whole dataframe?, say I want to bootstrap mtcars by cyl all at once and then use the output to estimate confidence intervals from that sample. — Salvador, Jul 25 '16 at 18:25
Thanks; it returns a list of n replications. Once I rbind all the lists together, I could get mean, sd, standard error and CI's correct? — Salvador, Jul 25 '16 at 18:54
Yes, but for bootstrapping, you want to take mean of each replication first, then combine them and find the 0.025 and 0.975 quantiles to get the 95%CI. — eipi10, Jul 25 '16 at 18:59

shayaa · Answer 3 · 2016-07-24T19:33:48.747

2

Using your code from above,

data.frame(boot=1:1000) %>%
  group_by(boot) %>% 
  do(sample_n(mtcars, nrow(mtcars), replace=TRUE)) %>%
  group_by(boot, vs) %>%
dplyr::summarise(mean.mpg = mean(mpg, na.rm = TRUE),
                 sd.mpg = sd(mpg, na.rm = TRUE),
                 n.mpg = n()) %>%
  mutate(se.mpg = sd.mpg / sqrt(n.mpg),
         lower.ci.mpg = mean.mpg - qt(1 - (0.1 / 2), n.mpg - 1) * se.mpg,
         upper.ci.mpg = mean.mpg + qt(1 - (0.1 / 2), n.mpg - 1) * se.mpg) %>% 
    group_by(vs) %>% summarise_each(funs(mean), vars = -boot)

The answer is

# A tibble: 2 x 7
     vs mean.mpg   sd.mpg n.mpg   se.mpg lower.ci.mpg upper.ci.mpg
  <dbl>    <dbl>    <dbl> <dbl>    <dbl>        <dbl>        <dbl>
1     0 16.62142 3.679562 17.97 0.876537     15.09220     18.15063
2     1 24.53193 5.125643 14.03 1.388702     22.05722     27.00663

edited Jul 24 '16 at 19:33

answered Jul 24 '16 at 18:02

shayaa

2,787
13
19

I think you need to `group_by` `vs` before resampling. Also, it looks like your confidence intervals are too narrow. It looks like that's because you've calculated the CI using the classical CI based on all the bootstrap samples together, rather than getting the quantiles of the vector of means of the individual bootstrap resamples. – eipi10 Jul 24 '16 at 18:16
Edited. Actually @eipi, thanks for this. I am new to SO, but sometimes I wonder if it encourages people to just hack instead of looking at their question or solution for sensibility. – shayaa Jul 24 '16 at 19:33
fixing this was actually a bit of a challenge. I could've written by own bootstrap function, but I wanted to use the same code provided by the OP. – shayaa Jul 24 '16 at 19:35
1

Nevermind Shayaa. the dplyr::summarise(mean.mpg = mean(mpg, na.rm = TRUE), sd.mpg = sd(mpg, na.rm = TRUE), n.mpg = n()) was producing NaN's and therefore carrying it down to the summary. I just na.omit=TRUE right below that line of code. – Salvador Jul 25 '16 at 18:16

score 0 · Answer 4 · answered May 23 '23 at 01:12

0

This works for me:

data.frame(g = gl(2, 10, 20), x = rnorm(20)) %>% 
    dplyr::group_by(g) %>% 
    dplyr::summarize(result = Hmisc::smean.cl.boot(x) %>% t %>% as.data.frame) %>% 
    tidyr::unnest(result)

answered May 23 '23 at 01:12

ade

146
1
3

bootstrapped confidence intervals with dplyr

4 Answers4

ORIGINAL ANSWER: Bootstrapping a single column

UPDATE: Bootstrapping any selection of columns

Other updates to answer questions in the comments