Accessing grouped subset in dplyr

Question

I have the feeling this was already asked several times, but I can not make it run in my case. Don't know why.

I group_by my data frame and calculate a mean from values. Additionally, I marked a specific row and I want to calculate the ratio of my fresh calculated mean with the value of my highlighted row of the subset.

library(dplyr)
df <- data.frame(int=c(5:1,4:1),
                 highlight=c(T,F,F,F,F,F,T,F,F),
                 exp=c('a','a','a','a','a','b','b','b','b'))

df %>%
  group_by(exp) %>%
  summarise(mean=mean(int),
            l1=nrow(.),
            ratio_mean=.[.$highlight, 'int']/mean)

But for some reason, . is not the subset of group_by but the complete input. Am I missing something here?

My expected output would be

exp    mean ratio_mean
  <fct> <dbl>      <dbl>
1 a       3         1.67
2 b       2.5       1.2

Use `n()` to count the number of rows in a subgroup. `.` refers to the piped input, i.e. the whole dataset — kath, Aug 17 '18 at 08:53
You can use `do()`: within it, `.` will refer to the subset data frame. See e.g. https://stackoverflow.com/questions/48182815/when-to-use-do-function-in-dplyr — Mikko Marttila, Aug 17 '18 at 08:55
So how can I access the subset but not the input? Or do I need to group by `highlight` and calculate the mean with `. %>% group_by(exp) %>% summarise(mean=mean(int))`? — drmariod, Aug 17 '18 at 08:55
@kath just solved it... I haven't seen this for some reason. Stupid question if I think about it now :-) Feel free to post it as an answer! — drmariod, Aug 17 '18 at 09:03

score 4 · Accepted Answer · answered Aug 17 '18 at 09:11

This works:

df %>%
  group_by(exp) %>%
  summarise(mean = mean(int),
            l1 = n(),
            ratio_mean = int[highlight] / mean)

But what's going wrong with your solution?

nrow(.) counts the number of rows of your whole input dataframe, wherase n() counts only the rows per group
.[.$highlight, 'int']/mean here again you use the whole input dataframe and subset using the highlight column, but it get's divided by the correct group mean. Actually you are returning two values here as two rows of your original df have a highlight = TRUE. This causes a nasty NA-column name.

To save it, we could use do() as suggested by @MikkoMarttila, but this gets a little bit clunky:

df %>% 
  group_by(exp) %>% 
  do(summarise(., mean = mean(.$int),
               l1 = nrow(.),
               ratio_mean = .$int[.$highlight] / mean))

Original output

df %>%
  group_by(exp) %>%
  summarise(mean=mean(int),
            l1=nrow(.),
            ratio_mean=.[.$highlight, 'int']/mean)

# A tibble: 2 x 4
#   exp    mean    l1 ratio_mean$    NA
#   <fct> <dbl> <int>       <dbl> <dbl>
# 1 a       3       9        1.67   2  
# 2 b       2.5     9        1      1.2

Thanks, I somehow didn't realised that within summary, I can directly assess the columns... :-) Maybe it was to early today ;-) — drmariod, Aug 17 '18 at 09:34

Accessing grouped subset in dplyr

1 Answers1