16

Last time I asked how it was possible to calculate the average score per measurement occasion (week) for a variable (procras) that has been measured repeatedly for multiple respondents. So my (simplified) dataset in long format looks for example like the following (here two students, and 5 time points, no grouping variable):

studentID  week   procras
   1        0     1.4
   1        6     1.2
   1        16    1.6
   1        28    NA
   1        40    3.8
   2        0     1.4
   2        6     1.8
   2        16    2.0
   2        28    2.5
   2        40    2.8

Using dplyr I would get the average score per measurement occasion

mean_data <- group_by(DataRlong, week)%>% summarise(procras = mean(procras, na.rm = TRUE))

Looking like this e.g.:

Source: local data frame [5 x 2]
        occ  procras
      (dbl)    (dbl)
    1     0 1.993141
    2     6 2.124020
    3    16 2.251548
    4    28 2.469658
    5    40 2.617903

With ggplot2 I could now plot the average change over time, and by easily adjusting the group_data() of dplyr I could also get means per sub groups (for instance, the average score per occasion for men and women). Now I would like to add a column to the mean_data table which includes the length for the 95%-CIs around the average score per occasion.

http://www.cookbook-r.com/Graphs/Plotting_means_and_error_bars_(ggplot2)/ explains how to get and plot CIs, but this approach seems to become problematic as soon as I wanted to do this for any subgroup, right? So is there a way to let dplyr also include the CI (based on group size, ect.) automatically in the mean_data? After that it should be fairly easy to plot the new values as CIs into the graphs I hope. Thank you.

Rasul89
  • 588
  • 2
  • 5
  • 14

6 Answers6

30

You could do it manually using mutate a few extra functions in summarise

library(dplyr)
mtcars %>%
  group_by(vs) %>%
  summarise(mean.mpg = mean(mpg, na.rm = TRUE),
            sd.mpg = sd(mpg, na.rm = TRUE),
            n.mpg = n()) %>%
  mutate(se.mpg = sd.mpg / sqrt(n.mpg),
         lower.ci.mpg = mean.mpg - qt(1 - (0.05 / 2), n.mpg - 1) * se.mpg,
         upper.ci.mpg = mean.mpg + qt(1 - (0.05 / 2), n.mpg - 1) * se.mpg)

#> Source: local data frame [2 x 7]
#> 
#>      vs mean.mpg   sd.mpg n.mpg    se.mpg lower.ci.mpg upper.ci.mpg
#>   (dbl)    (dbl)    (dbl) (int)     (dbl)        (dbl)        (dbl)
#> 1     0 16.61667 3.860699    18 0.9099756     14.69679     18.53655
#> 2     1 24.55714 5.378978    14 1.4375924     21.45141     27.66287
sboysel
  • 631
  • 7
  • 17
  • Thank you, that almost worked perfectly for me and I'm also able to plot the CI with ggplot. The only problem I have is, that n.mpg = n()) always gives me the same number, the total number of participants (n=566), regardless of whether they are missing or not. Due to the longitudinal design, drop out occurred, so using the total n makes the CI inaccurate as SE and the df are wrong. I tried to fix that by subtracting 'sum(as.numeric(is.na(DataRlong$procras)))' from the n() argument, but that would subtract the total number of missing cases across all occasion. – Rasul89 Mar 12 '16 at 14:03
  • How could I tell r only to subtract cases from n that were missing at the respective measurement occasion? – Rasul89 Mar 12 '16 at 14:04
  • There may be a better way to do this, but I have defined my own function to count the number of complete observations in the past. You could define a function `nobs <- function(x) length(x[!is.na(x)])` and replace `n()` with `nobs(procras)`. – sboysel Mar 12 '16 at 20:06
14

I use the ci command from the gmodels package:

library(gmodels)
your_db %>% group_by(gouping_variable1, grouping_variable2, ...)
        %>% summarise(mean = ci(variable_of_interest)[1], 
                      lowCI = ci(variable_of_interest)[2],
                      hiCI = ci(variable_of_interest)[3], 
                      sd = ci (variable_of_interest)[4])
carfisma
  • 307
  • 3
  • 7
4

If you want to use the versatility of the boot package, I found this blog post useful (the code below is inspired from there)

library(dplyr)
library(tidyr)
library(purrr)
library(boot)

set.seed(321)
mtcars %>%
  group_by(vs) %>%
  nest() %>% 
  mutate(boot_res = map(data,
                        ~ boot(data = .$mpg,
                               statistic = function(x, i) mean(x[i]),
                               R = 1000)),
         boot_res_ci = map(boot_res, boot.ci, type = "perc"),
         mean = map(boot_res_ci, ~ .$t0),
         lower_ci = map(boot_res_ci, ~ .$percent[[4]]),
         upper_ci = map(boot_res_ci, ~ .$percent[[5]]),
         n =  map(data, nrow)) %>% 
  select(-data, -boot_res, -boot_res_ci) %>% 
  unnest(cols = c(n, mean, lower_ci, upper_ci)) %>% 
  ungroup()
#> # A tibble: 2 x 5
#>      vs  mean lower_ci upper_ci     n
#>   <dbl> <dbl>    <dbl>    <dbl> <int>
#> 1     0  16.6     15.0     18.3    18
#> 2     1  24.6     22.1     27.3    14

Created on 2020-01-22 by the reprex package (v0.3.0)

Some explanation of the code:

When nesting with nest(), a list column (called by default data) is created, which contains 2 data frames, being the 2 subsets of the entire mtcars grouped by vs (which contains 2 unique values, 0 and 1). Then, using mutate() and map(), we create the list column boot_res by applying the function boot() from the boot package to the list column data. Then boot_res_ci list column is created by applying the boot.ci() function to boot_res list column and so on. With select(), we drop the list columns that are not needed anymore, fallowed by unnesting and ungrouping the final results.

The code is unfortunately not easy to navigate but it serves the purpose of another example.

Using broom::tidy()

Just realized that the package broom has an implementation of a method to deal with boot() outputs as pointed out here. This makes the code a bit less verbose and the output even a bit more complete, including the bias and the standard error of the statistic (the mean here):

library(dplyr)
library(tidyr)
library(purrr)
library(broom)
library(boot)

set.seed(321)
mtcars %>%
  group_by(vs) %>%
  nest() %>% 
  mutate(boot_res = map(data,
                        ~ boot(data = .$mpg,
                               statistic = function(x, i) mean(x[i]),
                               R = 1000)),
         boot_tidy = map(boot_res, tidy, conf.int = TRUE, conf.method = "perc"),
         n = map(data, nrow)) %>% 
  select(-data, -boot_res) %>% 
  unnest(cols = -vs) %>% 
  ungroup()
#> # A tibble: 2 x 7
#>      vs statistic    bias std.error conf.low conf.high     n
#>   <dbl>     <dbl>   <dbl>     <dbl>    <dbl>     <dbl> <int>
#> 1     0      16.6 -0.0115     0.843     15.0      18.3    18
#> 2     1      24.6 -0.0382     1.36      22.1      27.3    14

Created on 2020-01-22 by the reprex package (v0.3.0)

data.table concise syntax

Note, however, that, I got to a more concise syntax by using the data.table package instead of dplyr:

library(data.table)
library(magrittr)
library(boot)
library(broom)

mtcars <- mtcars %>% copy %>% setDT

set.seed(321)
mtcars[, c(n = .N,
           boot(data = mpg,
                statistic = function(x, i) mean(x[i]),
                R = 1000) %>% 
             tidy(conf.int = TRUE, conf.method = "perc")),
       by = vs]
#>    vs  n statistic        bias std.error conf.low conf.high
#> 1:  0 18  16.61667 -0.01149444 0.8425817 15.03917  18.26653
#> 2:  1 14  24.55714 -0.03822857 1.3633112 22.06429  27.32839

Created on 2020-01-23 by the reprex package (v0.3.0)

Multiple variables at once with data.table

library(data.table)
library(magrittr)
library(boot)
library(broom)

mtcars <- mtcars %>% copy %>% setDT

# Specify here the variables for which you want CIs
variables <- c("mpg", "disp") 

# Function to get the CI stats, will be applied to each column of a subset of
# data (.SD)
get_ci <- function(varb, ...){
  boot(data = varb,
       statistic = function(x, i) mean(x[i]),
       R = 1000) %>% 
    tidy(conf.int = TRUE, ...)
}

set.seed(321)
mtcars[, c(n = .N,
           lapply(.SD, get_ci) %>% 
             rbindlist(idcol = "varb")),
       by = vs, .SDcols = variables]
#>    vs  n varb statistic        bias  std.error  conf.low conf.high
#> 1:  0 18  mpg  16.61667 -0.01149444  0.8425817  15.03917  18.26653
#> 2:  0 18 disp 307.15000 -1.49692222 23.1501247 261.18766 353.04416
#> 3:  1 14  mpg  24.55714 -0.03215714  1.3800432  21.86628  27.50551
#> 4:  1 14 disp 132.45714  0.32994286 14.9070552 104.45798 163.57344

Created on 2020-01-23 by the reprex package (v0.3.0)

Valentin_Ștefan
  • 6,130
  • 2
  • 45
  • 68
2

Update tidyr 1.0.0

All the solutions given by @Valentin are viable but I wanted to hint to a new alternative which is more readable for some of you. It replaces all the summarise solutions by a relatively new [tidyr 1.0.0][1] function called unnest_wider. With that, you can simplify the code to the following:

mtcars %>% 
  nest(data = -"vs") %>%
  mutate(ci = map(data, ~ MeanCI(.x$mpg, method = "boot", R = 1000))) %>% 
  unnest_wider(ci)

which gives:

# A tibble: 2 x 5
     vs data                mean lwr.ci upr.ci
  <dbl> <list>             <dbl>  <dbl>  <dbl>
1     0 <tibble [18 × 10]>  16.6   14.7   18.5
2     1 <tibble [14 × 10]>  24.6   22.0   27.1

Calculating confidence intervals without bootstrapping is even simpler with:

mtcars %>% 
  nest(data = -"vs") %>%
  mutate(ci = map(data, ~ MeanCI(.x$mpg))) %>% 
  unnest_wider(ci)
Agile Bean
  • 6,437
  • 1
  • 45
  • 53
0

For normal distribution:

library(dplyr)
mtcars %>%
  group_by(vs) %>%
  summarise(mean.mpg = mean(mpg, na.rm = TRUE),
            sd.mpg = sd(mpg, na.rm = TRUE),
            n.mpg = n()) %>%
  mutate(se.mpg = sd.mpg / sqrt(n.mpg),
         lower.ci.mpg = mean.mpg - qnorm(0.975) * se.mpg,
         upper.ci.mpg = mean.mpg + qnorm(0.975) * se.mpg)
SysEng
  • 43
  • 8
0

Adding an answer in case others, as I did, found this post useful but are still looking for tweaks.

Here's an alternative manual solution based on @sboysel's response and using his 'nobs' function. This solution is useful if you want to summarise across groups within your data and across more than one variable (change across() to suit your data - here, it's coded for variables starting with a particular string):

output1 <- your_data_frame %>% 
  dplyr::group_by(your_grouping_variable) %>% 
  dplyr::summarise(across(starts_with("your_string"),
                          .fns = list(
                            mean = ~mean(.x, na.rm = TRUE), 
                            sd = ~sd(.x, na.rm = TRUE), 
                            se = ~sd(.x, na.rm = TRUE)/sqrt(length(.x)),
                            n = ~nobs(.x),
                            ci_l = ~mean(.x, na.rm = TRUE) - (1.96 * sd(.x, na.rm = TRUE)/sqrt(nobs(.x))),
                            ci_u = ~mean(.x, na.rm = TRUE) + (1.96 * sd(.x, na.rm = TRUE)/sqrt(nobs(.x))))))

Alternatively, use ci (as @carfisma says) from the gmodels package for more succinct code:

output2 <- your_data_frame%>% 
  dplyr::group_by(your_grouping_variable) %>% 
  dplyr::summarise(across(starts_with("your_string"),
                          .fns = list(
                            mean = ~ci(.x, na.rm=TRUE)[1],
                            se = ~ci(.x, na.rm=TRUE)[4],
                            n = ~nobs(.x),
                            ci_l = ~ci(.x, na.rm=TRUE)[2], # confidence level default is 0.95
                            ci_u = ~ci(.x, na.rm=TRUE)[3])))

N.B. note that the 4th element of the ci() output is std err, not sd as perhaps suggested in carfisma's solution.

Used dplyr version 1.0.10 and gmodels 2.18.1.1

user39683
  • 25
  • 5