Subset a data.table given the row index for each group

Question

This seems like a trivial question that I can't seem to find a solution for:

Consider the two data.tables

library(data.table)
dt <- data.table(id = c(1,1,1,2,2,2),
                 val = c(10,20,30,10,20,30))

dt1 <- data.table(id = c(1,2),
                  V1 = c(2,1))

How do I subset dt, where dt1 tells me the row number (V1) of the grouped id I need to subset?

For example, here the result will be

#    id val
# 1:  1  20
# 2:  2  10

Update

A quick bit of benchmarking on the proposed solutions

library(data.table)
s <- 100000
set.seed(123)
dt <- data.table(id = rep(seq(1:s), each=10),
                 val = rnorm(n = s*10, 0, 1))

dt1 <- data.table(id = seq(1:s),
                  V1 = sample(1:10, s, replace=T))


library(microbenchmark)

microbenchmark(

  akrun = { dt[dt1, on='id'][, .SD[1:.N==V1] ,id] },

  david = { dt[dt1, val[i.V1], on = 'id', by = .EACHI] },

  symbolix = { dt[, id_seq := seq(1:.N), by=id][dt1, on=c(id_seq = "V1", "id") , nomatch=0] },

   times = 5

 )
#Unit: milliseconds
#     expr         min          lq        mean      median          uq         max neval
#    akrun 17809.51370 17887.89037 18005.32357 18043.80279 18130.78978 18154.62118     5
#    david    48.17367    53.76436    53.79004    54.69096    55.59657    56.72467     5
 #symbolix   507.67312   511.23492   562.59743   571.31160   579.61228   643.15525     5

David Arenburg · Accepted Answer · 2016-03-17T10:15:06.020

6

Another option is to use by = .EACHI in order to subset val while joing

dt[dt1, val[i.V1], on = 'id', by = .EACHI]
#    id V1
# 1:  1 20
# 2:  2 10

If you have more columns there, you could use .SD[i.V1] instead.

As a side note, in data.table v >= 1.9.8 the .SD[val] operation is scheduled to be fully optimized to use GForce- so hold tight.

edited Mar 17 '16 at 10:15

answered Mar 17 '16 at 10:00

David Arenburg

91,361
17
137
196

So is `.SD` already optimized in 1.9.7 or not yet? I'm confused. – talat Mar 17 '16 at 10:26
@docendodiscimus it is optimized for cases when you explicitly use a positive integer such as `.SD[2L]` but not optimized if you pass it as a variable as in `val <- 2L ; .SD[val]`. This should be fixed in v 1.9.8 (hopefully). – David Arenburg Mar 17 '16 at 10:28
1

very nice. I was struggling with a `.EACHI` solution so I'm glad to see this! – SymbolixAU Mar 17 '16 at 10:52

score 3 · Answer 2 · answered Mar 17 '16 at 08:09

3

One option would be to join on 'id' and then do the subset

dt[dt1, on='id'][, .SD[1:.N==V1] ,id][,V1:=NULL][]
#   id val
#1:  1  20
#2:  2  10

answered Mar 17 '16 at 08:09

akrun

874,273
37
540
662

1

On a similar theme, then, perhaps `dt[, id_seq := seq(1:.N), by=id][dt1, on=c(id_seq = "V1", "id") , nomatch=0]` (to avoid the `.SD` allocation) ? – SymbolixAU Mar 17 '16 at 08:48

Subset a data.table given the row index for each group

2 Answers2

Linked