grouped operations that result in length not equal to 1 or length of group in dplyr

Question

I'm not sure which function to use to do the following:

library(data.table)
dt = data.table(a = 1:4, b = 1:2)

dt[, rep(a[1], 3), by = b]
#   b V1
#1: 1  1
#2: 1  1
#3: 1  1
#4: 2  2
#5: 2  2
#6: 2  2

Both summarise and mutate are unhappy with this length:

library(dplyr)
df = data.frame(a = 1:4, b = 1:2)

df %.% group_by(b) %.% summarise(rep(a[1], 3))
#Error: expecting a single value

df %.% group_by(b) %.% mutate(rep(a[1], 3))
#Error: incompatible size (3), expecting 2 (the group size) or 1

Don't know if it helps but using your `dplyr` code with a `data.table` works and with `plyr` you can do that too with a `data.frame`. — dickoa, Feb 12 '14 at 21:51
@dickoa thanks, that's interesting (fwiw this is mostly just an exercise for me to understand how to use `dplyr` - I don't really see the point of using it with a `data.table`); sounds like a bug in `summarise` then — eddi, Feb 12 '14 at 21:56
+1 This is an interesting difference; hopefully the final solution allows arbitrary return lengths for any groups. — BrodieG, Feb 13 '14 at 01:33
In this case `df %>% group_by(b) %>% slice(rep(1, 3))` works fine. For rowwise operations, where each row returns an arbitrary number of values, you can use the `df %>% mutate(new = map(old, f)) %>% unnest()` idiom. — Axeman, Mar 29 '17 at 07:38

score 13 · Accepted Answer · answered Aug 14 '14 at 15:12

13

In dplyr version 0.2 you could do this using the do operator:

> df %>% group_by(b) %>% do(data.frame(a = rep(.$a[1], 3)))
#Source: local data frame [6 x 2]
#Groups: b
#
#  b a
#1 1 1
#2 1 1
#3 1 1
#4 2 2
#5 2 2
#6 2 2

answered Aug 14 '14 at 15:12

talat

68,970
21
126
157

+1 for showing me what `do` can do (though note comments in my "answer") – BrodieG Nov 06 '14 at 14:14

BrodieG · Answer 2 · 2014-11-06T14:50:33.277

While @beginneR's answer does work, it doesn't seem to be a real substitute to the data.table behavior. Consider:

df <- data.frame(a = 1, b = rep(1:1e4, 2))
dt <- data.table(df)
microbenchmark(times=5,
  dt[, rep(a[1], 3), by = b],
  df %>% group_by(b) %>% do(data.frame(a = rep(.$a[1], 3)))
)

has the dplyr implementation >200x slower.

Unit: milliseconds
                                                      expr        min         lq     median         uq
                                dt[, rep(a[1], 3), by = b]   13.14318   13.70248   14.60524   15.26676
 df %>% group_by(b) %>% do(data.frame(a = rep(.$a[1], 3))) 3269.40731 3359.11614 3583.19430 3736.67162

Maybe there is a better way to do this with do that doesn't require calling data.frame each do? Also, the syntax is a bit involved for what is something very simple in data.table.

Otherwise, as per Hadley's issue link, it seems this is expected to be implemented in dplyr in 3.1, which looks to be the next release.

grouped operations that result in length not equal to 1 or length of group in dplyr

2 Answers2

Linked