1

I have a data.table whose rows consist of id, month, date and value. I would like to select the last observation for each (id,month) pair. Currently I do the following which is slow. Ideas for a better implementation?

require(data.table)
d = data.table(id = rep(1:3, each = 12*3), mo = rep(1:12,each=3), day = 1:3, key=c("id", "mo", "day")) #this is a data table with 3 ids, 12 months and 3 days per month
system.time(rslt <- d[,tail(.SD,1),by=list(id, mo)])
#    user  system elapsed 
#   0.019   0.000   0.020 

There should be a better way to do this. NOTE: the date at the end of the month could vary by id so i can't simply do setkey(d, day); d[J(3)]. One potential solution I was thinking of is extracting all the (id, mo) pairs and doing:

setkey(d, id, mo)
p = unique(d)[,day:=NULL]
setkey(p, id, mo)
system.time(rslt2 <- d[p, mult="last"])
#    user  system elapsed 
#   0.001   0.000   0.001 
identical(rslt2, rslt)
#  [1] TRUE
Alex
  • 19,533
  • 37
  • 126
  • 195
  • @eddi: i dont' understand the syntax in the question you mention. In particular, evaulating any integer (such as `.I[.N]`) in `d[,some_integer]` simply would return `some_integer` no? – Alex Jan 15 '14 at 20:20
  • yep, that's correct, it returns the integer indices of the rows that you want, which you then extract from your `data.table` (notice that there are two `[`'s) – eddi Jan 15 '14 at 20:29
  • oh hah, that was silly of me.. missed the double `[`. truth. feel free to delete my q (or i can do this) – Alex Jan 15 '14 at 20:29
  • 2
    I think it's good to leave these duplicates as signposts – eddi Jan 15 '14 at 20:34

0 Answers0