mean by group, excluding selected rows

Question

I'll take this old post as reference. So, the modified dataset looks like the following:

df <- data.frame(dive = factor(sample(c("dive1","dive2","dive3","dive4"), 14, replace=TRUE)),
                 speed = runif(14)
                 )
> df
     dive       speed
1  dive1 0.627296799
2  dive1 0.288594538
3  dive4 0.598177856
4  dive2 0.371158436
5  dive2 0.827468739
6  dive3 0.485977449
7  dive2 0.151295215
8  dive4 0.773988372
9  dive2 0.567155356
10 dive1 0.008585884
11 dive4 0.433648371
12 dive2 0.759196515
13 dive2 0.641193241
14 dive3 0.089451537

I would like to modify the column speed so that it contains the mean per group (same entry for each .group) for dive1 and dive2, and do nothing (keep df as it is) for the other two groups).

I tried with if (and, of course, group_by and summarise), but that's not what I want, I receive a warning message and only 4 results...

df2 <- if(!(df$dive %in% c("dive3", "dive4"))){
  summarise(group_by(df, dive), speed = mean(speed))
} 

Warning message:
In if (!(df$dive %in% c("dive3", "dive4"))) { :
  the condition has length > 1 and only the first element will be used

> df2
# A tibble: 4 x 2
  dive  speed
  <fct> <dbl>
1 dive1 0.860
2 dive2 0.460
3 dive3 0.277
4 dive4 0.330

(1) It's a warning, not an error (though it'll become an error in R-4.3, I believe). (2) `if` requires always exactly length 1 (see https://stackoverflow.com/q/14170778/3358272, https://stackoverflow.com/q/10374932/3358272), but `df$dive %in% ...` is going to return a vector of length `nrow(df)`. If you intend that determination to be done within the `group_by` and `summarize`, then see the last code block in my answer. If that doesn't work, please add your expected output (as a literal frame object) in your question. Thanks! — r2evans, Apr 19 '23 at 15:31
it's a common (and easy) thing to misinterpret one for the other, though some warnings _should_ be errors (and vice-versa, though more rare) — r2evans, Apr 19 '23 at 18:05

r2evans · Accepted Answer · 2023-04-19T15:28:47.660

4

df %>%
  group_by(dive) %>%
  mutate(speed = if (first(dive) %in% c("dive1", "dive2")) mean(speed) else speed) %>%
  ungroup()
# # A tibble: 14 × 2
#    dive   speed
#    <fct>  <dbl>
#  1 dive4 0.548 
#  2 dive3 0.156 
#  3 dive4 0.207 
#  4 dive3 0.148 
#  5 dive4 0.886 
#  6 dive1 0.498 
#  7 dive3 0.690 
#  8 dive1 0.498 
#  9 dive4 0.0968
# 10 dive3 0.596 
# 11 dive2 0.447 
# 12 dive2 0.447 
# 13 dive3 0.859 
# 14 dive3 0.663

or perhaps a little shorter using

df %>%
  mutate(speed = if (first(dive) %in% c("dive1", "dive2")) mean(speed) else speed,
         .by = dive)

If I misunderstood, and instead you want to reduce the two groups to a single row while keeping other groups as-is (not reduced), then perhaps:

df %>%
  filter(dive %in% c("dive1", "dive2")) %>%
  summarize(speed = mean(speed), .by = dive) %>%
  bind_rows(filter(df, !dive %in% c("dive1", "dive2")))
#     dive     speed
# 1  dive1 0.4983562
# 2  dive2 0.4470575
# 3  dive4 0.5477776
# 4  dive3 0.1558491
# 5  dive4 0.2068528
# 6  dive3 0.1479428
# 7  dive4 0.8858552
# 8  dive3 0.6896862
# 9  dive4 0.0967569
# 10 dive3 0.5961494
# 11 dive3 0.8593978
# 12 dive3 0.6634452

edited Apr 19 '23 at 15:28

answered Apr 19 '23 at 15:04

r2evans

141,215
6
77
149

Exactly this! I was looking for a way to use `if`and `mutate` indeed, but I could not understand how to do this! – Matteo Bulgarelli Apr 19 '23 at 16:34
Why do I need `first()`? – Matteo Bulgarelli Apr 19 '23 at 17:11
I see, because `if` takes conditions of length 1 only. This is quite counterintuitive, though. – Matteo Bulgarelli Apr 19 '23 at 17:29
1

You _could_ change `if (first(dive) %in% ...) ... else ...` into `ifelse(dive %in% c(..), mean(speed), speed)`, certainly. If you prefer that, go with it. It is more efficient and faster to do it with `if` here, since we know with certainty that `dive` is invariant within a group, so much easier to check just one instead of all of them. – r2evans Apr 19 '23 at 18:07
1

Ohhh this is a great explanation, I'll remember this. Very very useful! – Matteo Bulgarelli Apr 19 '23 at 18:30

mean by group, excluding selected rows

1 Answers1