7

Using data.table I can do the following:

library(data.table)
dt = data.table(a = 1:2, b = c(1,2,NA,NA))
#   a  b
#1: 1  1
#2: 2  2
#3: 1 NA
#4: 2 NA

dt[, b := b[1], by = a]
#   a b
#1: 1 1
#2: 2 2
#3: 1 1
#4: 2 2

Attempting the same operation in dplyr however the data gets scrambled/sorted by a:

library(dplyr)
dt = data.table(a = 1:2, b = c(1,2,NA,NA))
dt %.% group_by(a) %.% mutate(b = b[1])
#  a b
#1 1 1
#2 1 1
#3 2 2
#4 2 2

(as an aside the above also sorts the original dt, which is somewhat confusing for me given dplyr's philosophy of not modifying in place - I'm guessing that's a bug with how dplyr interfaces with data.table)

What's the dplyr way of achieving the above?

eddi
  • 49,088
  • 6
  • 104
  • 155

1 Answers1

1

In the current development version of dplyr (which will eventually become dplyr 0.2) the behaviour differs between data frames and data tables:

library(dplyr)
library(data.table)

df <- data.frame(a = 1:2, b = c(1,2,NA,NA))
dt <- data.table(df)

df %.% group_by(a) %.% mutate(b = b[1])

## Source: local data frame [4 x 2]
## Groups: a
## 
##   a b
## 1 1 1
## 2 2 2
## 3 1 1
## 4 2 2

dt %.% group_by(a) %.% mutate(b = b[1])

## Source: local data table [4 x 2]
## Groups: a
## 
##   a b
## 1 1 1
## 2 1 1
## 3 2 2
## 4 2 2

This happens because group_by() applied to a data.table automatically does setkey() on the assumption that the index will make future operations faster.

If there's a strong feeling that this is a bad default, I'm happy to change it.

hadley
  • 102,019
  • 32
  • 183
  • 245
  • thanks, I think it's pretty obvious that you want the same behavior between `data.frame` and `data.table`, and that the `data.frame` behavior is the correct one – eddi Mar 19 '14 at 20:51
  • @eddi counter argument: dplyr should take advantage of data.table's performance by default, where possible. – hadley Mar 19 '14 at 20:53
  • @hadley, the ideal version would be to allow for both `adhoc-by` and `setkey` during `group_by`. In case that design isn't possible, I'd go with `adhoc-by` personally, as with the recent internal optimisations in 1.9.0+, `adhoc-by` is incredibly fast. Plus, it has no overhead due to the `setkey` operation which'll have to reorder the entire data (think biggggg data). Also, you add a `copy` on top of it, which'll make both memory and time requirements higher. Considering all this, if you've to go with only one of the two, I'd go with `adhoc-by`. – Arun Mar 19 '14 at 20:57
  • I doubt you get speed gains by doing this - I'm fairly certain `dt[, b := b[1], by = a]` is faster than `setkey(dt, a); dt[, b := b[1], by = a]` always. But even if you did get performance gains, you should never compromise correctness for performance. – eddi Mar 19 '14 at 20:57
  • Maybe having both somehow would be nicer. For example, [Here's a more complex case](http://stackoverflow.com/questions/16878905/data-table-outer-join-by-group) where both `adhoc-by` and `keys` are used. – Arun Mar 19 '14 at 21:03
  • @Arun do you know what the `dplyr` analogue of that complex case is? – eddi Mar 19 '14 at 21:11
  • @eddi, not to my knowledge. – Arun Mar 19 '14 at 21:19
  • Unfortunately the design of `group_by()` makes it very difficult to add additional arguments that are passed on to individual methods. But I'll definitely change the default and think about how to make it an option. – hadley Mar 19 '14 at 21:37
  • @eddi I don't think dplyr can handle that well until we add non-materialised cartesian joins – hadley Mar 19 '14 at 21:40
  • @hadley Yes if `group_by()` just uses ad-hoc by (i.e. just `by=` and no need for `setkey`) that'll be better. – Matt Dowle Mar 20 '14 at 15:20