dplyr group_by a large number of variables

Question

Sometimes one has a number of variables with the same grouping, in particular as a result of doing a gather on some variables, ex:

  x0     x1    x2 variable     value
1  1   Male Green        1 0.1803306
2  1   Male Green        2 0.5619410
3  1   Male Green        3 0.9905186
4  2 Female  Blue        1 0.1549419
5  2 Female  Blue        2 0.6917326
6  2 Female  Blue        3 0.6509738

In such a case, I'd like to compute a grouped summary statistic (say, group_by(x0) %>% summarize(sum(value))) while preserving all the ID variables given by the first columns. One way is to do group_by(x0, x1, x2) but this becomes a little messy if there are a large number of ID variables, and group_by doesn't seem to work with the functions from select, so I can't do group_by(starts_with("x")). How can I cleanly preserve all my ID variables post-summarize without typing out each variable name individually?

Oh, I think I see what you're saying, in which case your real problem is that your data is stored poorly. Try reading what the dplyr author suggest regarding "tidy data" https://www.jstatsoft.org/article/view/v059i10 — Frank, Jul 28 '16 at 16:09

score 2 · Accepted Answer · edited Jul 29 '16 at 04:48

Not as clean as dplyr built in solution, we can still have some work around using grep and group_by_ function where the .dots parameters allow us to pass a vector of character as names:

df %>% 
     group_by_(.dots = grep("^x", names(df), value = T)) %>% 
     summarize(s_value = sum(value))

# Source: local data frame [2 x 4]
# Groups: x0, x1 [?]

#     x0     x1     x2  s_value
#  <int> <fctr> <fctr>    <dbl>
#1     1   Male  Green 1.732790
#2     2 Female   Blue 1.497648

grep("^x", ...) will act the same as starts_with except that we need to pass the names of the data frame manually and specify the value parameter to be TRUE so that it will return a vector of x0, x1, x2 we can group_by_.

dplyr group_by a large number of variables

1 Answers1