How can I calculate a group-level statistic omitting a given sub-group nested within the group?

Question

I have long data that is students nested within classrooms. I would like to calculate various class-level statistics for each student about the classroom that they study in, but exclude the student's own data in this calculation.

A simple example would be as below:

 df <- data.frame(
  class_id = c(rep("a", 6), rep("b", 6)),
  student_id = c(rep(1, 3), rep(2, 2), rep(3, 1), rep(4, 2), rep(5, 3), rep(6, 1)),
  value = rnorm(12)
)

As shown above, I six students in two classrooms, each of which has one or more observations of value. It's easy to get the student-level average with:

df %>% 
  group_by(class_id, student_id) %>% 
  summarize(value = mean(value))

or to add a classroom-level average with:

df %>% 
  group_by(class_id) %>% 
  mutate(class_avg = mean(value))

but I can't figure out how to tell dplyr to "leave out" a given group in the higher-level group level calculation. This is similar to the question asked here, but that calculates the mean of all groups except for the given group. I'm not sure how to modify this with dplyr to get what I want.

Thanks for your help.

Edit: After @akrun's request, the expected output is below (using a slightly modified version of @jared_mamrot's answer). As you can see, the class_mean_othstudents variable takes the value of the mean of the students in each class except for the given student. Jared's solution works but is a very manual approach and would only apply to getting a mean value. I am wondering if there is a dplyr way to do this more generally.

set.seed(123)

df <- data.frame(
  class_id = c(rep("a", 6), rep("b", 6)),
  student_id = c(rep(1, 3), rep(2, 2), rep(3, 1), rep(4, 2), rep(5, 3), rep(6, 1)),
  value = rnorm(12)
)

df %>% 
  group_by(class_id, student_id) %>%
  summarize(student_mean = mean(value)) %>% 
  mutate(class_mean_othstudents = 
           (sum(student_mean) - student_mean)/(n() - 1)
  )

`summarise()` has grouped output by 'class_id'. You can override using the `.groups` argument.
# A tibble: 6 x 4
# Groups:   class_id [2]
  class_id student_id student_mean class_mean_othstudents
  <chr>         <dbl>        <dbl>                  <dbl>
1 a                 1       0.256                  0.907 
2 a                 2       0.0999                 0.986 
3 a                 3       1.72                   0.178 
4 b                 4      -0.402                  0.195 
5 b                 5       0.0305                -0.0211
6 b                 6       0.360                 -0.186

I am confused about how you want to compute `class_avg`. Your sample code `df %>% group_by(class_id) %>% mutate(class_avg = mean(value))` returns the *overall mean* of all samples within a class but the accepted answer as well as the expected result use the *mean of means*. *Overall mean* and *mean of means* are different as explained [here](https://stats.stackexchange.com/q/133138/113809), e.g. So, please, let us know what definition you prefer. Thank you. — Uwe, Jul 13 '21 at 16:28
Thanks @Uwe, this is a good point for clarification. I could see some use cases where you would want the overall mean (it would allow you to weight students with more observed data more heavily), but for me, I was definitely more interested in the means of students, and hence the means of means. — jsizzle, Jul 13 '21 at 19:18

akrun · Accepted Answer · 2021-07-13T01:51:01.107

2

Based on the update, we may loop over the row_number(), get the 'student_mean' values that are not from the current row, get the mean

library(dplyr)
library(purrr)
df %>% 
  group_by(class_id, student_id) %>%
  summarize(student_mean = mean(value), .groups = 'drop_last') %>% 
  mutate(class_mean_othstudents = map_dbl(row_number(), ~ 
          mean(student_mean[-.x]))) %>%
  ungroup

-output

# A tibble: 6 x 4
  class_id student_id student_mean class_mean_othstudents
  <chr>         <dbl>        <dbl>                  <dbl>
1 a                 1       0.256                  0.907 
2 a                 2       0.0999                 0.986 
3 a                 3       1.72                   0.178 
4 b                 4      -0.402                  0.195 
5 b                 5       0.0305                -0.0211
6 b                 6       0.360                 -0.186

edited Jul 13 '21 at 01:51

answered Jul 12 '21 at 23:23

akrun

874,273
37
540
662

I don't think this quite answers my question. Here the class_avg variable is the average of all the students in the class, rather than all the students *except* for the given student. So class_avg for student_id 1 should be the mean of student_id 2 and 3, class_avg for student 2 should be the mean of student_id 1 and 3, etc. – jsizzle Jul 13 '21 at 01:08
@jsizzle you can check the update. I didnt understand your comment about `manual` approach with the other solution – akrun Jul 13 '21 at 01:51
1

Thanks so much, this is super helpful. My comment about a "manual" approach was in reference to @jared_mamrot (very useful!) comment. This relies on a hand calculation of the mean and is not a general solution that would easily allow a different . Your answer is more general because it allows us to call the "mean" function directly on student_mean[-.x] and calculate other statistics by calling a different function there. – jsizzle Jul 13 '21 at 02:48

jared_mamrot · Answer 2 · 2021-07-13T00:21:58.917

Depending on the statistics you want for each classroom you could calculate them 'manually', e.g. classroom_mean = sum(x) / n; classroom_mean_excluding_the_student_in_question = sum(x) - x / n - 1

E.g.

library(tidyverse)
set.seed(123)
df <- data.frame(
  class_id = c(rep("a", 6), rep("b", 6)),
  student_id = c(rep(1, 3), rep(2, 2), rep(3, 1), rep(4, 2), rep(5, 3), rep(6, 1)),
  value = rnorm(12)
)

df %>% 
  group_by(class_id, student_id)  %>% 
  summarise(student_mean = mean(value)) %>% 
  mutate(class_mean_exc_this_student = (
    sum(student_mean) - student_mean)/(n() - 1)
    )
#> `summarise()` has grouped output by 'class_id'. You can override using the `.groups` argument.
#> # A tibble: 6 x 4
#> # Groups:   class_id [2]
#>   class_id student_id student_mean class_mean_exc_this_student
#>   <chr>         <dbl>        <dbl>                       <dbl>
#> 1 a                 1       0.256                       0.907 
#> 2 a                 2       0.0999                      0.986 
#> 3 a                 3       1.72                        0.178 
#> 4 b                 4      -0.402                       0.195 
#> 5 b                 5       0.0305                     -0.0211
#> 6 b                 6       0.360                      -0.186

^{Created on 2021-07-13 by the reprex package (v2.0.0)}

This definitely works for the "mean" case and is a clever workaround. I think is very similar to this post (https://stackoverflow.com/questions/35858876/calculate-group-mean-while-excluding-current-observation-using-dplyr). It is not a general solution for an unspecified function. Does that exist? — jsizzle, Jul 13 '21 at 01:41

Uwe · Answer 3 · 2021-07-13T17:04:06.197

Here is a solution which computes student_mean and class_mean_othstudents bottom up as overall mean. The result differs from the other answers posted so far which use the mean of means to compute class_mean_othstudents:

library(data.table)
setDT(df)[, lapply(unique(student_id), 
                   \(sid) .(student_id = sid, 
                            student_mean = mean(value[student_id == sid]),
                            class_mean_othstudents = mean(value[student_id != sid]))) |>
            rbindlist(), 
          by = .(class_id)]

   class_id student_id student_mean class_mean_othstudents
1:        a          1   0.25601839              0.6382870
2:        a          2   0.09989806              0.6207800
3:        a          3   1.71506499              0.1935703
4:        b          4  -0.40207251              0.1128452
5:        b          5   0.03052233             -0.1481104
6:        b          6   0.35981383             -0.1425156

For the sake of completeness and for comparison with the other answers here is the version which is using means of means:

library(data.table)
setDT(df)[, .(student_mean = mean(value)), by = .(class_id, student_id)][
  , class_student_mean := 
    .SD[, sapply(student_id, \(sid) mean(student_mean[student_id != sid])), 
        by = class_id]$V1][]

   class_id student_id student_mean class_student_mean
1:        a          1   0.25601839         0.90748153
2:        a          2   0.09989806         0.98554169
3:        a          3   1.71506499         0.17795823
4:        b          4  -0.40207251         0.19516808
5:        b          5   0.03052233        -0.02112934
6:        b          6   0.35981383        -0.18577509

This result is in line with the other two answers which are based on means of means

Data

Note that the same seed as in Jared's answer and in OP's edit ist used.

set.seed(123)
df <- data.frame(
  class_id = c(rep("a", 6), rep("b", 6)),
  student_id = c(rep(1, 3), rep(2, 2), rep(3, 1), rep(4, 2), rep(5, 3), rep(6, 1)),
  value = rnorm(12)
)

How can I calculate a group-level statistic omitting a given sub-group nested within the group?

3 Answers3

Data