24

I noticed that when supplying column indices to dplyr::summarize_at the column to be summarized is determined excluding the grouping column(s). I wonder if that is how it's supposed to be since by this design, using the correct column index depends on whether the summarising column(s) are positioned before or after the grouping columns.

Here's an example:

library(dplyr)
data("mtcars")

# grouping column after summarise columns
mtcars %>% group_by(gear) %>% summarise_at(3:4, mean)
## A tibble: 3 x 3
#   gear     disp       hp
#  <dbl>    <dbl>    <dbl>
#1     3 326.3000 176.1333
#2     4 123.0167  89.5000
#3     5 202.4800 195.6000

# grouping columns before summarise columns
mtcars %>% group_by(cyl) %>% summarise_at(3:4, mean)
## A tibble: 3 x 3
#    cyl        hp     drat
#  <dbl>     <dbl>    <dbl>
#1     4  82.63636 4.070909
#2     6 122.28571 3.585714
#3     8 209.21429 3.229286

# no grouping columns
mtcars %>% summarise_at(3:4, mean)
#      disp       hp
#1 230.7219 146.6875

# actual third & fourth columns
names(mtcars)[3:4]
#[1] "disp" "hp"  

packageVersion("dplyr")
#[1] ‘0.7.2’

Notice how the summarised columns change depending on grouping and position of the grouping column.

Is this the same on other platforms? Is it a bug or a feature?

talat
  • 68,970
  • 21
  • 126
  • 157
  • 1
    Seems to be intended, as the `summarise_at` calls `tbl_nongroup_vars`, which gets the tibble without grouping variables. `mean` is then applied on that set. – lukeA Aug 25 '17 at 14:28
  • 3
    @lukeA, thanks for checking that out! I have to say it feels quite counter-intuitive to me if I have to determine the index and subtract the number of grouping columns before them. – talat Aug 25 '17 at 14:30
  • 3
    Instead of using `3:4`, it would be safer to do something like `vars(disp:hp)`. For example: `mtcars %>% group_by(cyl) %>% summarise_at(vars(disp:hp), mean)` – MrFlick Aug 25 '17 at 14:31
  • 3
    @MrFlick, yes, I agree, and I almost never used the indexing option. But since it's there, I was surprised when I discovered that – talat Aug 25 '17 at 14:32
  • 1
    There is still possibility to use index `mtcars %>% group_by(cyl) %>% summarise_at(.vars = colnames(.)[3:4] , mean)`. Anyway, @docendodiscimus thanks for pointing this out, because even if this feature was intentional, documentation doesn't explicitly explain this and in my case could be source of errors. – GoGonzo Sep 20 '17 at 06:46
  • Perhaps this was a bug that has since been fixed (or created). I tried duplicating the problem by running the given code, but the columns are not different. But my package version is 0.5.0. – James Theobald Sep 27 '17 at 20:04
  • Not able to reproduce different columns after grouping based on index of column. Using dplyr version 0.5.0 – Sowmya S. Manian Oct 19 '17 at 19:05
  • 1
    @SowmyaS.Manian that version is outdated – talat Oct 19 '17 at 19:07
  • Will check with updated one. Although If it does exists, it should be a bug. – Sowmya S. Manian Oct 19 '17 at 19:10
  • 1
    @Gonzo I think your comment would make a nice answer for this post, as it's one of the top scored unanswered R questions. – moodymudskipper Nov 24 '17 at 17:57

2 Answers2

4

with version 0.7.5 this behavior can't be reproduced anymore:

  library(dplyr)
  mtcars %>% group_by(gear) %>% summarise_at(3:4, mean)
  # # A tibble: 3 x 3
  #    gear  disp    hp
  #   <dbl> <dbl> <dbl>
  # 1     3  326. 176. 
  # 2     4  123.  89.5
  # 3     5  202. 196. 

  mtcars %>% group_by(cyl) %>% summarise_at(3:4, mean)
  # # A tibble: 3 x 3
  #     cyl  disp    hp
  #   <dbl> <dbl> <dbl>
  # 1     4  105.  82.6
  # 2     6  183. 122. 
  # 3     8  353. 209. 
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
3

@docendodiscimus thanks for pointing this out, because even if this feature was intentional, documentation doesn't explicitly explain this and in my case could be source of errors. Actually, this problem was solved before answering on the other question, and my comment above does it properly with the same logic.


At this moment, possible solution is to provide names instead of indexes. But one is still able to make it using indexes just by adding few symbols .vars = names(.)[3:4], like below:

mtcars %>% 
  group_by(cyl) %>% 
  summarise_at( .vars = colnames(.)[3:4] , mean)

mtcars %>% 
  group_by(cyl) %>% 
  summarise_at( .vars = names(.)[3:4] , mean)


## A tibble: 3 x 3
#    cyl     disp        hp
#  <dbl>    <dbl>     <dbl>
#1     4 105.1364  82.63636
#2     6 183.3143 122.28571
#3     8 353.1000 209.21429
GoGonzo
  • 2,637
  • 1
  • 18
  • 25