67

How do I tell group_by to group the data by all columns except a given one?

With aggregate, it would be aggregate(x ~ ., ...).

I tried group_by(data, -x), but that groups by the negative-of-x (i.e. the same as grouping by x).

Roman Cheplyaka
  • 37,738
  • 7
  • 72
  • 121
  • 1
    http://stackoverflow.com/a/28182288/4604054 is one way to do this, its quite long winded in comparison to aggregate. – zacdav Aug 27 '16 at 12:41

3 Answers3

100

dplyr version 1.0+

In dplyr 1.0.0 coming up, the _at functions are falling into the superseded lifecycle (i.e. while they remain in dplyr for the foreseeable future, there are now better alternatives that are more actively developed). The new way to accomplish this is via the across function:

df %>%
  group_by(across(c(-hp)))

dplyr v 0.7+

A small update on this question because I stumbled across this myself and found an elegant solution with current version of dplyr (0.7.4): Inside group_by_at(), you can supply the names of columns the same way as in the select() function using vars(). This enables us to group by everything but one column (hp in this example) by writing:

library(dplyr)
df <- as_tibble(mtcars, rownames = "car")
df %>% group_by_at(vars(-hp))
Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
Jannik Buhr
  • 1,788
  • 2
  • 11
  • 11
  • 4
    You may even supply several columns to ignore: `df %>% group_by_at(vars(-hp, -cyl))` without needing to use the c() construct. Super nice! – Lionel Trebuchon Jun 03 '19 at 13:43
  • 5
    it even works `group_by(across(!hp)`. Would be great if this could be marked as the answer now. – s_pike Oct 09 '20 at 06:18
40

Building on the @eipi10's dplyr 0.7.0 edit, group_by_at appears to be the right function for this job. However, if you are simply looking to omit column "x", then you can use:

new2.0 <- dat %>%
  group_by_at(vars(-x)) %>%
  summarize(mean_value = mean(value))

Using @eipi10's example data:

# Fake data
set.seed(492)
dat <- data.frame(value = rnorm(1000),
             g1 = sample(LETTERS, 1000, replace = TRUE),
             g2 = sample(letters, 1000, replace = TRUE),
             g3 = sample(1:10, replace = TRUE),
             other = sample(c("red", "green", "black"), 1000, replace = TRUE))

new <- dat %>% 
  group_by_at(names(dat)[-grep("value", names(dat))]) %>%
  summarise(meanValue = mean(value))


new2.0 <- dat %>% 
  group_by_at(vars(-value)) %>% 
  summarize(meanValue = mean(value))

identical(new, new2.0)
# [1] TRUE
Alex M
  • 2,756
  • 7
  • 29
  • 35
ZS27
  • 451
  • 4
  • 7
35

You can do this using standard evaluation (group_by_ instead of group_by):

# Fake data
set.seed(492)
dat = data.frame(value=rnorm(1000), g1=sample(LETTERS,1000,replace=TRUE),
                 g2=sample(letters,1000,replace=TRUE), g3=sample(1:10, replace=TRUE),
                 other=sample(c("red","green","black"),1000,replace=TRUE))

dat %>% group_by_(.dots=names(dat)[-grep("value", names(dat))]) %>%
  summarise(meanValue=mean(value))
       g1     g2    g3  other   meanValue
   <fctr> <fctr> <int> <fctr>       <dbl>
1       A      a     2  green  0.89281475
2       A      b     2    red -0.03558775
3       A      b     5  black -1.79184218
4       A      c    10  black  0.17518610
5       A      e     5  black  0.25830392
...

See this vignette for more on standard vs. non-standard evaluation in dplyr.

UPDATE for dplyr 0.7.0

To address @ÖmerAn's comment: It looks like group_by_at is the way to go in dplyr 0.7.0 (someone please correct me if I'm wrong about this). For example:

dat %>% 
  group_by_at(setdiff(names(dat), "value")) %>%
  summarise(meanValue=mean(value))
# Groups:   g1, g2, g3 [?]
       g1     g2    g3  other   meanValue
   <fctr> <fctr> <int> <fctr>       <dbl>
 1      A      a     2  green  0.89281475
 2      A      b     2    red -0.03558775
 3      A      b     5  black -1.79184218
 4      A      c    10  black  0.17518610
 5      A      e     5  black  0.25830392
 6      A      e     5    red -0.81879788
 7      A      e     7  green  0.30836054
 8      A      f     2  green  0.05537047
 9      A      g     1  black  1.00156405
10      A      g    10  black  1.26884303
# ... with 949 more rows

Let's confirm both methods give the same output (in dplyr 0.7.0):

new = dat %>% 
  group_by_at(setdiff(names(dat), "value")) %>%
  summarise(meanValue=mean(value))

old = dat %>% 
  group_by_(.dots=names(dat)[-grep("value", names(dat))]) %>%
  summarise(meanValue=mean(value))

identical(old, new)
# [1] TRUE
Ömer An
  • 600
  • 5
  • 16
eipi10
  • 91,525
  • 24
  • 209
  • 285
  • 13
    Instead of `names(dat)[-grep("value", names(dat))]`, you can also use `setdiff(names(dat), "value")` – Jaap Aug 28 '16 at 17:16
  • 2
    Although it's risker, I guess you could even select by position: `names(dat)[-1]`. We're golfing, right? :) – eipi10 Aug 28 '16 at 17:30
  • Certainly also a valid option :-) – Jaap Aug 28 '16 at 20:23
  • Also, if you want to summarise all columns at once use `dplyr::summarise_each(funs(mean))` – Boern Apr 24 '17 at 08:28
  • How to do this in the newest version of `dplyr 0.7.0` which claims the SE versions of the verbs are now deprecated? https://github.com/tidyverse/dplyr/releases – Ömer An Jun 22 '17 at 06:54
  • As @eipi10 pointed out, is there an advantage to `group_by_at` over chaining `select(-value) %>% group_by()`? I thought group_by_at might keep the excluded column, but it drops the excluded column. – CPak Jun 22 '17 at 20:23