72

Assume I have a data.table containing some baseball players:

library(plyr)
library(data.table)

bdt <- as.data.table(baseball)

For each group (given by player 'id'), I want to select rows corresponding to the maximum number of games 'g'. This is straightforward in plyr:

ddply(baseball, "id", subset, g == max(g))

What's the equivalent code for data.table?

I tried:

setkey(bdt, "id") 
bdt[g == max(g)]  # only one row
bdt[g == max(g), by = id]  # Error: 'by' or 'keyby' is supplied but not j
bdt[, .SD[g == max(g)]] # only one row

This works:

bdt[, .SD[g == max(g)], by = id] 

But it's is only 30% faster than plyr, suggesting it's probably not idiomatic.

Henrik
  • 65,555
  • 14
  • 143
  • 159
hadley
  • 102,019
  • 32
  • 183
  • 245
  • 2
    Wow, that is slow, but if you use "year" in place of ".SD"... I'm getting .01, 1.58, 2.39 user time for year, .SD, plyr, respectively. – Frank May 15 '13 at 20:11
  • @Frank but I want the whole data frame, not just the year. I'll clarify the question. – hadley May 15 '13 at 20:13

1 Answers1

92

Here's the fast data.table way:

bdt[bdt[, .I[g == max(g)], by = id]$V1]

This avoids constructing .SD, which is the bottleneck in your expressions.

edit: Actually, the main reason the OP is slow is not just that it has .SD in it, but the fact that it uses it in a particular way - by calling [.data.table, which at the moment has a huge overhead, so running it in a loop (when one does a by) accumulates a very large penalty.

eddi
  • 49,088
  • 6
  • 104
  • 155
  • 5
    +1 I'm betting that Hadley wants to do this somewhat programmatically, in which case he'd want to use this syntax, `bdt[bdt[, .I[g == max(g)], by = id][,V1]]` right? – joran May 15 '13 at 20:23
  • 2
    @joran I'm constructing the call manually, so it doesn't really matter – hadley May 15 '13 at 20:24
  • 6
    Eventually the original approach will be optimized. See [FR 2330](https://r-forge.r-project.org/tracker/index.php?func=detail&aid=2330&group_id=240&atid=978) Optimize `.SD[i]` query to keep the elegance but make it faster unchanged. – mnel May 15 '13 at 23:05
  • 3
    That issue link since moved from R-Forge to GitHub here [#613](https://github.com/Rdatatable/data.table/issues/613) – Matt Dowle Feb 23 '16 at 19:03
  • If I add `verbose = TRUE` to the inner frame, I see `GForce FALSE`, yet it's still faster than something like `bdt[bdt[, .(g=max(g)), by=id], on=c("id","g")]`, though I don't know if that would always be the case. – Alexis Jul 06 '19 at 15:40