16

I'm not sure which function to use to do the following:

library(data.table)
dt = data.table(a = 1:4, b = 1:2)

dt[, rep(a[1], 3), by = b]
#   b V1
#1: 1  1
#2: 1  1
#3: 1  1
#4: 2  2
#5: 2  2
#6: 2  2

Both summarise and mutate are unhappy with this length:

library(dplyr)
df = data.frame(a = 1:4, b = 1:2)

df %.% group_by(b) %.% summarise(rep(a[1], 3))
#Error: expecting a single value

df %.% group_by(b) %.% mutate(rep(a[1], 3))
#Error: incompatible size (3), expecting 2 (the group size) or 1
eddi
  • 49,088
  • 6
  • 104
  • 155
  • Don't know if it helps but using your `dplyr` code with a `data.table` works and with `plyr` you can do that too with a `data.frame`. – dickoa Feb 12 '14 at 21:51
  • @dickoa thanks, that's interesting (fwiw this is mostly just an exercise for me to understand how to use `dplyr` - I don't really see the point of using it with a `data.table`); sounds like a bug in `summarise` then – eddi Feb 12 '14 at 21:56
  • See https://github.com/hadley/dplyr/issues/154 – hadley Feb 12 '14 at 22:13
  • +1 This is an interesting difference; hopefully the final solution allows arbitrary return lengths for any groups. – BrodieG Feb 13 '14 at 01:33
  • In this case `df %>% group_by(b) %>% slice(rep(1, 3))` works fine. For rowwise operations, where each row returns an arbitrary number of values, you can use the `df %>% mutate(new = map(old, f)) %>% unnest()` idiom. – Axeman Mar 29 '17 at 07:38

2 Answers2

13

In dplyr version 0.2 you could do this using the do operator:

> df %>% group_by(b) %>% do(data.frame(a = rep(.$a[1], 3)))
#Source: local data frame [6 x 2]
#Groups: b
#
#  b a
#1 1 1
#2 1 1
#3 1 1
#4 2 2
#5 2 2
#6 2 2
talat
  • 68,970
  • 21
  • 126
  • 157
7

While @beginneR's answer does work, it doesn't seem to be a real substitute to the data.table behavior. Consider:

df <- data.frame(a = 1, b = rep(1:1e4, 2))
dt <- data.table(df)
microbenchmark(times=5,
  dt[, rep(a[1], 3), by = b],
  df %>% group_by(b) %>% do(data.frame(a = rep(.$a[1], 3)))
)

has the dplyr implementation >200x slower.

Unit: milliseconds
                                                      expr        min         lq     median         uq
                                dt[, rep(a[1], 3), by = b]   13.14318   13.70248   14.60524   15.26676
 df %>% group_by(b) %>% do(data.frame(a = rep(.$a[1], 3))) 3269.40731 3359.11614 3583.19430 3736.67162

Maybe there is a better way to do this with do that doesn't require calling data.frame each do? Also, the syntax is a bit involved for what is something very simple in data.table.

Otherwise, as per Hadley's issue link, it seems this is expected to be implemented in dplyr in 3.1, which looks to be the next release.

BrodieG
  • 51,669
  • 9
  • 93
  • 146