2

I have long data that is students nested within classrooms. I would like to calculate various class-level statistics for each student about the classroom that they study in, but exclude the student's own data in this calculation.

A simple example would be as below:

 df <- data.frame(
  class_id = c(rep("a", 6), rep("b", 6)),
  student_id = c(rep(1, 3), rep(2, 2), rep(3, 1), rep(4, 2), rep(5, 3), rep(6, 1)),
  value = rnorm(12)
)

As shown above, I six students in two classrooms, each of which has one or more observations of value. It's easy to get the student-level average with:

df %>% 
  group_by(class_id, student_id) %>% 
  summarize(value = mean(value))

or to add a classroom-level average with:

df %>% 
  group_by(class_id) %>% 
  mutate(class_avg = mean(value))

but I can't figure out how to tell dplyr to "leave out" a given group in the higher-level group level calculation. This is similar to the question asked here, but that calculates the mean of all groups except for the given group. I'm not sure how to modify this with dplyr to get what I want.

Thanks for your help.

Edit: After @akrun's request, the expected output is below (using a slightly modified version of @jared_mamrot's answer). As you can see, the class_mean_othstudents variable takes the value of the mean of the students in each class except for the given student. Jared's solution works but is a very manual approach and would only apply to getting a mean value. I am wondering if there is a dplyr way to do this more generally.

set.seed(123)

df <- data.frame(
  class_id = c(rep("a", 6), rep("b", 6)),
  student_id = c(rep(1, 3), rep(2, 2), rep(3, 1), rep(4, 2), rep(5, 3), rep(6, 1)),
  value = rnorm(12)
)

df %>% 
  group_by(class_id, student_id) %>%
  summarize(student_mean = mean(value)) %>% 
  mutate(class_mean_othstudents = 
           (sum(student_mean) - student_mean)/(n() - 1)
  )

`summarise()` has grouped output by 'class_id'. You can override using the `.groups` argument.
# A tibble: 6 x 4
# Groups:   class_id [2]
  class_id student_id student_mean class_mean_othstudents
  <chr>         <dbl>        <dbl>                  <dbl>
1 a                 1       0.256                  0.907 
2 a                 2       0.0999                 0.986 
3 a                 3       1.72                   0.178 
4 b                 4      -0.402                  0.195 
5 b                 5       0.0305                -0.0211
6 b                 6       0.360                 -0.186 
jsizzle
  • 78
  • 8
  • 1
    I am confused about how you want to compute `class_avg`. Your sample code `df %>% group_by(class_id) %>% mutate(class_avg = mean(value))` returns the *overall mean* of all samples within a class but the accepted answer as well as the expected result use the *mean of means*. *Overall mean* and *mean of means* are different as explained [here](https://stats.stackexchange.com/q/133138/113809), e.g. So, please, let us know what definition you prefer. Thank you. – Uwe Jul 13 '21 at 16:28
  • Thanks @Uwe, this is a good point for clarification. I could see some use cases where you would want the overall mean (it would allow you to weight students with more observed data more heavily), but for me, I was definitely more interested in the means of students, and hence the means of means. – jsizzle Jul 13 '21 at 19:18

3 Answers3

2

Based on the update, we may loop over the row_number(), get the 'student_mean' values that are not from the current row, get the mean

library(dplyr)
library(purrr)
df %>% 
  group_by(class_id, student_id) %>%
  summarize(student_mean = mean(value), .groups = 'drop_last') %>% 
  mutate(class_mean_othstudents = map_dbl(row_number(), ~ 
          mean(student_mean[-.x]))) %>%
  ungroup

-output

# A tibble: 6 x 4
  class_id student_id student_mean class_mean_othstudents
  <chr>         <dbl>        <dbl>                  <dbl>
1 a                 1       0.256                  0.907 
2 a                 2       0.0999                 0.986 
3 a                 3       1.72                   0.178 
4 b                 4      -0.402                  0.195 
5 b                 5       0.0305                -0.0211
6 b                 6       0.360                 -0.186 
akrun
  • 874,273
  • 37
  • 540
  • 662
  • I don't think this quite answers my question. Here the class_avg variable is the average of all the students in the class, rather than all the students *except* for the given student. So class_avg for student_id 1 should be the mean of student_id 2 and 3, class_avg for student 2 should be the mean of student_id 1 and 3, etc. – jsizzle Jul 13 '21 at 01:08
  • @jsizzle you can check the update. I didnt understand your comment about `manual` approach with the other solution – akrun Jul 13 '21 at 01:51
  • 1
    Thanks so much, this is super helpful. My comment about a "manual" approach was in reference to @jared_mamrot (very useful!) comment. This relies on a hand calculation of the mean and is not a general solution that would easily allow a different . Your answer is more general because it allows us to call the "mean" function directly on student_mean[-.x] and calculate other statistics by calling a different function there. – jsizzle Jul 13 '21 at 02:48
2

Depending on the statistics you want for each classroom you could calculate them 'manually', e.g. classroom_mean = sum(x) / n; classroom_mean_excluding_the_student_in_question = sum(x) - x / n - 1

E.g.

library(tidyverse)
set.seed(123)
df <- data.frame(
  class_id = c(rep("a", 6), rep("b", 6)),
  student_id = c(rep(1, 3), rep(2, 2), rep(3, 1), rep(4, 2), rep(5, 3), rep(6, 1)),
  value = rnorm(12)
)

df %>% 
  group_by(class_id, student_id)  %>% 
  summarise(student_mean = mean(value)) %>% 
  mutate(class_mean_exc_this_student = (
    sum(student_mean) - student_mean)/(n() - 1)
    )
#> `summarise()` has grouped output by 'class_id'. You can override using the `.groups` argument.
#> # A tibble: 6 x 4
#> # Groups:   class_id [2]
#>   class_id student_id student_mean class_mean_exc_this_student
#>   <chr>         <dbl>        <dbl>                       <dbl>
#> 1 a                 1       0.256                       0.907 
#> 2 a                 2       0.0999                      0.986 
#> 3 a                 3       1.72                        0.178 
#> 4 b                 4      -0.402                       0.195 
#> 5 b                 5       0.0305                     -0.0211
#> 6 b                 6       0.360                      -0.186  

Created on 2021-07-13 by the reprex package (v2.0.0)

jared_mamrot
  • 22,354
  • 4
  • 21
  • 46
  • This definitely works for the "mean" case and is a clever workaround. I think is very similar to this post (https://stackoverflow.com/questions/35858876/calculate-group-mean-while-excluding-current-observation-using-dplyr). It is not a general solution for an unspecified function. Does that exist? – jsizzle Jul 13 '21 at 01:41
1

Here is a solution which computes student_mean and class_mean_othstudents bottom up as overall mean. The result differs from the other answers posted so far which use the mean of means to compute class_mean_othstudents:

library(data.table)
setDT(df)[, lapply(unique(student_id), 
                   \(sid) .(student_id = sid, 
                            student_mean = mean(value[student_id == sid]),
                            class_mean_othstudents = mean(value[student_id != sid]))) |>
            rbindlist(), 
          by = .(class_id)]
   class_id student_id student_mean class_mean_othstudents
1:        a          1   0.25601839              0.6382870
2:        a          2   0.09989806              0.6207800
3:        a          3   1.71506499              0.1935703
4:        b          4  -0.40207251              0.1128452
5:        b          5   0.03052233             -0.1481104
6:        b          6   0.35981383             -0.1425156

For the sake of completeness and for comparison with the other answers here is the version which is using means of means:

library(data.table)
setDT(df)[, .(student_mean = mean(value)), by = .(class_id, student_id)][
  , class_student_mean := 
    .SD[, sapply(student_id, \(sid) mean(student_mean[student_id != sid])), 
        by = class_id]$V1][]
   class_id student_id student_mean class_student_mean
1:        a          1   0.25601839         0.90748153
2:        a          2   0.09989806         0.98554169
3:        a          3   1.71506499         0.17795823
4:        b          4  -0.40207251         0.19516808
5:        b          5   0.03052233        -0.02112934
6:        b          6   0.35981383        -0.18577509

This result is in line with the other two answers which are based on means of means

Data

Note that the same seed as in Jared's answer and in OP's edit ist used.

set.seed(123)
df <- data.frame(
  class_id = c(rep("a", 6), rep("b", 6)),
  student_id = c(rep(1, 3), rep(2, 2), rep(3, 1), rep(4, 2), rep(5, 3), rep(6, 1)),
  value = rnorm(12)
)
Uwe
  • 41,420
  • 11
  • 90
  • 134