When I use my own function in a group_by() and summarize() chain, it incorrectly returns the same result for each group

Question

I'm sure I'm missing something about how grouping works. When I use my own function within a summarize statement (after grouping) I get the same result for each group, which is wrong. Also I don't get any errors or warnings, it's just silently giving me the wrong answer.

My goal is to get this custom function to play nice with group_by.

Here is the code:

library(dplyr)

#data
transect <- data.frame(acronym  = c("ABEESC", "ABIBAL", "AMMBRE", "ANTELE", "ABEESC", "ABIBAL", "AMMBRE"),
                       quad_id = c(1, 1, 1, 1, 2, 2, 2))
#scores
c_scores <- data.frame(acronym  = c("ABEESC", "ABIBAL", "AMMBRE", "ANTELE"),
                       c = c(5, 6, 6, 10))

#custom fun
my_fun <- function(data, scores){
  join <- left_join(data, scores, by = "acronym")
  mean <- mean(join$c)
  return(mean)
}

#this works
my_fun(transect, c_scores)

#this also works
transect %>% my_fun(., c_scores)

#this doesn't...
transect %>%
  group_by(quad_id) %>%
  summarise(mean_c = my_fun(., scores = c_scores))

this is my result:

quad_id	mean_c
1	6.29
2	6.29

this is what I want:

quad_id	mean_c
1	6.75
2	5.66

akrun · Accepted Answer · 2022-10-27T18:02:17.600

We may use cur_data() as input to the function instead of . as . can take the full dataset instead of subset of data in the group

library(dplyr)
transect %>%
  group_by(quad_id) %>%
  summarise(mean_c = my_fun(cur_data(), scores = c_scores))

-output

# A tibble: 2 × 2
  quad_id mean_c
    <dbl>  <dbl>
1       1   6.75
2       2   5.67

If we want a message when it is grouped, then use is_grouped_df

my_fun2 <- function(data, scores)
 {
  
  if(dplyr::is_grouped_df(data))
  {
   message("data is grouped, so use cur_data() as data")
  }
  
 left_join(data, scores, by = "acronym") %>%
       pull(c) %>%
       mean
  
 
}

-testing

 > transect %>%
 +   group_by(quad_id) %>%
 +   summarise(mean_c = my_fun2(., scores = c_scores))
 data is grouped, so use cur_data() as data
 data is grouped, so use cur_data() as data
 # A tibble: 2 × 2
   quad_id mean_c
     <dbl>  <dbl>
 1       1   6.29
 2       2   6.29
 > transect %>%
 +   group_by(quad_id) %>%
 +   summarise(mean_c = my_fun2(cur_data(), scores = c_scores))
 # A tibble: 2 × 2
   quad_id mean_c
     <dbl>  <dbl>
 1       1   6.75
 2       2   5.67

Note that the messages are repeated as the function is applied multiple times (n number of groups) after the grouping when it is inside summarise. If we do it outside, the message will be printed once

> transect %>% 
    group_by(quad_id) %>% 
    my_fun2(., c_scores)
data is grouped, so use cur_data() as data
[1] 6.285714

If we want a single function, we may also do

my_fun3 <- function(data, scores, grps = NULL)
{
data <- left_join(data, scores, by = "acronym")
if(!missing(grps)) 
{
 data <- data %>%
    group_by(across(all_of(grps)))

}
data %>%
    summarise(mean_c = mean(c, na.rm = TRUE))

}

-testing

>  my_fun3(transect, c_scores, "quad_id")
# A tibble: 2 × 2
  quad_id mean_c
    <dbl>  <dbl>
1       1   6.75
2       2   5.67
> 
> my_fun3(transect, c_scores)
    mean_c
1 6.285714

or simplify without any if condition using missing by making use of any_of in group_by

my_fun3 <- function(data, scores, grps = NULL)
{
left_join(data, scores, by = "acronym") %>%
    group_by(across(any_of(grps))) %>% 
    summarise(mean_c = mean(c, na.rm = TRUE))

}

-testing

> my_fun3(transect, c_scores, "quad_id")
# A tibble: 2 × 2
  quad_id mean_c
    <dbl>  <dbl>
1       1   6.75
2       2   5.67
> my_fun3(transect, c_scores)
# A tibble: 1 × 1
  mean_c
   <dbl>
1   6.29

Thanks @akrun! This is super helpful! I was wondering if there is anything I can change inside the function to make it 1. behave more like a typical group_by friendly function or 2. create a warning when it's used with group_by? My actual function is a lot more complicated and will be used by other people...it seems like mistakes could easily be made. Also, I'm new to stack, let me know if this should be a different question. — ifoxfoot, Oct 27 '22 at 17:15
@ifoxfoot your conditions are not clear. If you want to check for grouped, you can use `dplyr::is_grouped_df(data)` assuming the data is grouped — akrun, Oct 27 '22 at 17:24
Okay, thanks, that helps with condition 2. To rephrase condition 1, what can I do to my function to make it work without `cur_data()`? Is it even possible? Like how `n()` and `mean()` work inside summarize without `cur_data()`. I don't understand the mechanism that makes them different from my function...I'm pretty sure I'm just missing something about how `group_by()` works. Or maybe how `.` works — ifoxfoot, Oct 27 '22 at 17:51
@ifoxfoot Why can't you use `cur_data()` in all cases i.e. if it is not grouped, it will be the full dataset or if it is grouped, then only the grouped data? — akrun, Oct 27 '22 at 17:54
@ifoxfoot Also, assuming that you are creating a functional flow, then can't you make a wrapper function that groups or not and calls the my_fun2 in summarise?. In that case you may just need `my_main_fun(transect, c_scores, grps = "quad_id")` — akrun, Oct 27 '22 at 17:55

When I use my own function in a group_by() and summarize() chain, it incorrectly returns the same result for each group

1 Answers1

Linked