6

Here's a simple example to illustrate the issue:

library(data.table)
dt = data.table(a = c(1,1,2,2), b = 1:2)

dt[, c := cumsum(a), by = b][, d := cumsum(a), by = c]
#   a b c d
#1: 1 1 1 1
#2: 1 2 1 2
#3: 2 1 3 2
#4: 2 2 3 4

Attempting to do the same in dplyr I fail because the first group_by is persistent and the grouping is by both b and c:

df = data.frame(a = c(1,1,2,2), b = 1:2)

df %.% group_by(b) %.% mutate(c = cumsum(a)) %.%
       group_by(c) %.% mutate(d = cumsum(a))
#  a b c d
#1 1 1 1 1
#2 1 2 1 1
#3 2 1 3 2
#4 2 2 3 2

Is this a bug or a feature? If it's a feature, then how would one replicate the data.table solution in a single statement?

eddi
  • 49,088
  • 6
  • 104
  • 155
  • 1
    Actually with the newer `dplyr` versions you code will work correctly because `dplyr` drops aggregation level per each `mutate`/`summarise` operation. – David Arenburg Apr 20 '15 at 12:10

1 Answers1

7

Try this:

> df %>% group_by(b) %>% mutate(c = cumsum(a)) %>%
+        group_by(c) %>% mutate(d = cumsum(a))
Source: local data frame [4 x 4]
Groups: c

  a b c d
1 1 1 1 1
2 1 2 1 2
3 2 1 3 2
4 2 2 3 4

Update

With newer version of dplyr use %>% rather than %.% and ungroup is no longer needed (as per David Arenburg's comment).

G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • thanks, a sideways related question - any idea why `df %.% group_by(b) %.% summarise(cumsum(a))` doesn't work (and how to make that work)? – eddi Feb 12 '14 at 19:03
  • 1
    Use `mutate` like this: `df %.% group_by(b) %.% mutate(cumsum = cumsum(a))` – G. Grothendieck Feb 12 '14 at 19:06
  • thanks, I guess a better example of the issue is how would one replicate `dt[, rep(a, 3), by = b]`? – eddi Feb 12 '14 at 19:08
  • @eddi Not sure if this is what you are asking but ... mutate produces a variable as long as the data. summarise produces a variable as long as the number of groups. When I tried your code with a data.frame or a data.table as input I got a different result in both 0.1.1 and the development version. See this [issue on github](https://github.com/hadley/dplyr/issues/258) – Vincent Feb 12 '14 at 19:28
  • @Vincent I added a new question about it - http://stackoverflow.com/q/21737815/817778 – eddi Feb 12 '14 at 19:34
  • 1
    You can also use `group_by(c, add = F)` – hadley Feb 12 '14 at 22:14
  • And it's a deliberate feature - I think it makes the most common case easier with the known tradeoff that it makes some things harder to express. – hadley Feb 12 '14 at 22:15
  • Actually you don't need `ungroup` here at all. This question/answer is outdated. At least in its current format. – David Arenburg Apr 20 '15 at 12:11