How to subset data.table using keys

Question

I have a data.table and I'm trying to subset it so I get the lowest TEST_RESULT by ID and LAB_DT:

DT <- data.table::data.table(ID=c("1","1","1","2","2","3","3","3"),
                             LAB_DT=lubridate::as_date(c("1992-11-01",
                                                         "1992-11-01",
                                                         "1992-11-02",
                                                         "1992-11-04",
                                                         "1992-11-04",
                                                         "1992-11-06",
                                                         "1992-11-06",
                                                         "1992-11-08")),
                             TEST_RESULT=c(5.4,5.8,5.2,5.6,6,6,7,8))
setkeyv(DT,c("ID","LAB_DT","TEST_RESULT"))

   ID     LAB_DT TEST_RESULT
1:  1 1992-11-01         5.4
2:  1 1992-11-01         5.8
3:  1 1992-11-02         5.2
4:  2 1992-11-04         5.6
5:  2 1992-11-04         6.0
6:  3 1992-11-06         6.0
7:  3 1992-11-06         7.0
8:  3 1992-11-08         8.0

I have already successfully done this via my own method:

DT[,FIRST.LAB_DT:=0]
DT[, FIRST.LAB_DT := c(1L, FIRST.LAB_DT[-1]), by = .(ID,LAB_DT)]

   ID     LAB_DT TEST_RESULT FIRST.LAB_DT
1:  1 1992-11-01         5.4            1
2:  1 1992-11-01         5.8            0
3:  1 1992-11-02         5.2            1
4:  2 1992-11-04         5.6            1
5:  2 1992-11-04         6.0            0
6:  3 1992-11-06         6.0            1
7:  3 1992-11-06         7.0            0
8:  3 1992-11-08         8.0            1

DT[FIRST.LAB_DT==1,]

   ID     LAB_DT TEST_RESULT FIRST.LAB_DT
1:  1 1992-11-01         5.4            1
2:  1 1992-11-02         5.2            1
3:  2 1992-11-04         5.6            1
4:  3 1992-11-06         6.0            1
5:  3 1992-11-08         8.0            1

However, the actual data.table I'm working with has ~8e6 rows and I recently found that you can subset data.table using the keys in a much faster manner.

Would anyone know how to produce the same final output using keys?

Nothing to do with keys, but I think it will be the fastest way to get the result: `DT[DT[, .I[which.min(TEST_RESULT)], .(ID, LAB_DT)]$V1]` — IceCreamToucan, Sep 16 '19 at 20:41
@IceCreamToucan I second on this to be the the fastest. I think there was this question from *hadley* that asked about performance of subsetting in `data.table` and there were good discussions/answer on that thread. I'll see if I can find it. p.s. found it already. — M--, Sep 16 '19 at 21:33

score 1 · Answer 1 · answered Sep 16 '19 at 22:27

Since you report that your current approach works, your data is apparently sorted by increasing TEST_RESULT already. In this case, you can just drop duplicates to keep the first row per group:

unique(DT, by=c("ID", "LAB_DT"))

In general, if it is not sorted, you could roll from -Inf. Like your code, this selects only one row even if there are ties for the min value:

mDT = unique(DT[, .(ID, LAB_DT)])[, TEST_RESULT := -Inf]
DT[DT[mDT, on=names(mDT), roll=-Inf, which=TRUE]]

Finally, if you only have these three columns, you can just use the min function which is optimized for speed here (see ?GForce):

DT[, .(TEST_RESULT = min(TEST_RESULT)), by=.(ID, LAB_DT)]

TJ87 · Answer 2 · 2019-09-16T21:39:30.973

0

library(dplyr)
DT %>% 
group_by(ID,LAB_DT) %>% 
slice(which.min(TEST_RESULT))

edited Sep 16 '19 at 21:39

answered Sep 16 '19 at 21:30

TJ87

404
1
3
13

OP explicitly asked for a `data.table` solution; not sure how a `dplyr`-based solution helps here. – Maurits Evers Sep 16 '19 at 22:16

How to subset data.table using keys

2 Answers2