Denominator when using two variables in group_by in R tidyverse

Question

I want to calculate the mean and standard deviation contacts for twenty types of hospital services in two arms of a trial. I have done this so far by using group_by(arm, service). This gives the average of the people who use that service in that arm. What my boss wants instead is the average of each service, divided by everyone in that arm.

So, if there are 100 cardiology contacts, 30 patients in each arm, but 10 attend a cardiology appointment, the calculation should be 100/30 rather than 100/10. The only way I can think about doing it is splitting the arms out into separate datasets and then I would only need to group by service, which solves the problem.

An example of what this looks like:

rep_prob <- tibble(id = 1:6, arm = c(1,1,1,0,0,0), service = c(1,1,2,1,2,2), contacts = c(21,3,14, 2,5,10)) %>% 
  group_by(arm, service) %>% 
  summarise(mean = mean(contacts), sd = sd(contacts))

Which gives results that look like this:

arm  service  mean   sd
0     1        2.0   NaN
0     2        7.5   3.535534
1     1        12.0  12.727922
1     2        14.0  NaN

Where instead I want the option to give the mean and SD of each service compared to the arm as a whole, not as the subgroup of arm and service.

This is apparently very easy in Stata and I am the only person in my department who uses R. For all my other results tables I am only slicing my table by one variable and so using group_by(arm) and then summarising works.

Please make your question [reproducible](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) to avoid it being erroneously marked as a duplicate. — NelsonGon, Jun 05 '19 at 15:35
I'm also having a very difficult time understanding what exactly you're trying to do. A more fully fleshed out example with some example data and code, rather than just describing it in words, might help. — joran, Jun 05 '19 at 15:36
May be you need to `dat %>% group_by(service) %>% mutate(n = n()) %>% group_by(arm, add = TRUE) %>% mutate(n1 = n/n())` — akrun, Jun 05 '19 at 15:37
no, my bad: after re-reading the question you're right... though it is a bit weird — Pierre Gramme, Jun 05 '19 at 16:50
I've added in some example code and hopefully clarified some things. I'll try akrun's code but I'm not sure that it works with summarise, and as I need to do more than just a mean I would like to use the stats functions within R to do the calculations if possible. — Sarah Roberts, Jun 07 '19 at 09:45

Joris C. · Answer 1 · 2019-06-07T19:40:12.903

Perhaps what you are after is along the lines of:

library(tidyverse)

dat <- tibble(
    id = 1:6, 
    arm = c(1,1,1,0,0,0), 
    service = c(1,1,2,1,2,2), 
    contacts = c(21,3,14, 2,5,10)
) 

rep_prob <- dat %>% 
    group_by(arm, service) %>% 
    mutate(sum = sum(contacts)) %>%
    group_by(arm) %>%
    mutate(mean = sum / sum(contacts)) %>%
    ungroup()

which calculates group sums by arm and service divided by group sample sizes per arm category. The definition of the sd would depend on the way the observations are being centered (i.e. how the sample mean is defined per group).

NB: splitting dat into separate datasets by the variable arm and grouping by service would give the same results as grouping by both arm and service directly, which is probably not what you have in mind.

Edit: if you prefer to use summarise, you could also rearrange expressions as:

rep_prob <- dat %>% 
   group_by(arm) %>% 
   mutate(contacts_scaled = contacts / sum(contacts)) %>%
   group_by(service, add = TRUE) %>%
   summarise(mean = sum(contacts_scaled)) %>%
   ungroup()

Denominator when using two variables in group_by in R tidyverse

1 Answers1