R - Getting first N-1 rows per group

Question

This is a use case where we have timestamped data with id (e.g. multiple observations over time for each subject), and we want to use all the previous measurements to predict the last one in our dataset.

This is related to the question: How to select the first and last row within a grouping variable in a data frame?

Currently I'm working with the data.table package which is very efficient in selecting the first or last row per group using the solution in the linked question.

When I try to select the first N_g-1 rows (where N_g is the number of rows in the current group) the query takes an very long time. Does anybody know of an efficient way to do something like that. Here's what I'm using currently:

firstn_elements <- dt[, .SD[1:(.N-1)], by=subject_id]

Try with `.I`, i.e. `dt[dt[, .I[1:(.N-1)], by = subject_id]$V1]` — akrun, Jun 29 '16 at 17:31
[Related Q&A](http://stackoverflow.com/questions/16325641/in-r-is-it-possible-to-extact-the-first-2-rows-for-each-date-from-a-data-table) (with benchmarks of `.SD` vs. `.I` extractions). — Henrik, Jun 29 '16 at 17:42
Very related Q&A, arguably a dupe: http://stackoverflow.com/q/16573995/ — Frank, Jun 29 '16 at 17:55

akrun · Accepted Answer · 2016-06-29T17:41:54.993

3

We can do this a bit more faster with .I to extract the row index.

dt[dt[, .I[1:(.N-1)], by = subject_id]$V1]

Benchmarks

set.seed(24)
dt <- data.table(subject_id = sample(1:100000, 1e7, replace=TRUE), value = rnorm(1e7))
system.time(dt[, .SD[1:(.N-1)], by=subject_id])
#  user  system elapsed 
# 45.89   17.92   65.00 
system.time(dt[dt[, .I[1:(.N-1)], by = subject_id]$V1])
#   user  system elapsed 
#   1.53    0.19    1.13

Including @JoshO'Brien's method

system.time(dt[dt[, -.I[.N], by = subject_id]$V1])
#  user  system elapsed 
#  0.69    0.04    0.55

edited Jun 29 '16 at 17:41

answered Jun 29 '16 at 17:32

akrun

874,273
37
540
662

Thank you @akrun, what does the $V1 do in this case? – Bar Jun 29 '16 at 17:35
1

@Bar The dt[, .I[1:(.N-1)], by = subject_id]` creates a 'V1' column as we didn't name the `.I[1:(.N-1)]`. Extract that column with `$V1` – akrun Jun 29 '16 at 17:37
3

FWIW, this is a bit faster: `dt[dt[, -.I[.N], by = subject_id]$V1]` – Josh O'Brien Jun 29 '16 at 17:39
Thanks! Any idea why using `.SD` is so slow in the code that I wrote? – Bar Jun 29 '16 at 17:41
@Bar It would have the overhead of `.[data.table` – akrun Jun 29 '16 at 17:45
1

Probably better to use `head(.I, .N-1L)` or Josh's idea, since the `.N==1L` case would yield weird stuff with the version headlining this answer. – Frank Jun 29 '16 at 17:53

R - Getting first N-1 rows per group

1 Answers1

Benchmarks