2

This is a use case where we have timestamped data with id (e.g. multiple observations over time for each subject), and we want to use all the previous measurements to predict the last one in our dataset.

This is related to the question: How to select the first and last row within a grouping variable in a data frame?

Currently I'm working with the data.table package which is very efficient in selecting the first or last row per group using the solution in the linked question.

When I try to select the first N_g-1 rows (where N_g is the number of rows in the current group) the query takes an very long time. Does anybody know of an efficient way to do something like that. Here's what I'm using currently:

firstn_elements <- dt[, .SD[1:(.N-1)], by=subject_id]
Community
  • 1
  • 1
Bar
  • 2,736
  • 3
  • 33
  • 41
  • Try with `.I`, i.e. `dt[dt[, .I[1:(.N-1)], by = subject_id]$V1]` – akrun Jun 29 '16 at 17:31
  • [Related Q&A](http://stackoverflow.com/questions/16325641/in-r-is-it-possible-to-extact-the-first-2-rows-for-each-date-from-a-data-table) (with benchmarks of `.SD` vs. `.I` extractions). – Henrik Jun 29 '16 at 17:42
  • 2
    Very related Q&A, arguably a dupe: http://stackoverflow.com/q/16573995/ – Frank Jun 29 '16 at 17:55

1 Answers1

3

We can do this a bit more faster with .I to extract the row index.

dt[dt[, .I[1:(.N-1)], by = subject_id]$V1]

Benchmarks

set.seed(24)
dt <- data.table(subject_id = sample(1:100000, 1e7, replace=TRUE), value = rnorm(1e7))
system.time(dt[, .SD[1:(.N-1)], by=subject_id])
#  user  system elapsed 
# 45.89   17.92   65.00 
system.time(dt[dt[, .I[1:(.N-1)], by = subject_id]$V1])
#   user  system elapsed 
#   1.53    0.19    1.13 

Including @JoshO'Brien's method

system.time(dt[dt[, -.I[.N], by = subject_id]$V1])
#  user  system elapsed 
#  0.69    0.04    0.55 
akrun
  • 874,273
  • 37
  • 540
  • 662