1

I have a data.table and I'm trying to subset it so I get the lowest TEST_RESULT by ID and LAB_DT:

DT <- data.table::data.table(ID=c("1","1","1","2","2","3","3","3"),
                             LAB_DT=lubridate::as_date(c("1992-11-01",
                                                         "1992-11-01",
                                                         "1992-11-02",
                                                         "1992-11-04",
                                                         "1992-11-04",
                                                         "1992-11-06",
                                                         "1992-11-06",
                                                         "1992-11-08")),
                             TEST_RESULT=c(5.4,5.8,5.2,5.6,6,6,7,8))
setkeyv(DT,c("ID","LAB_DT","TEST_RESULT"))

   ID     LAB_DT TEST_RESULT
1:  1 1992-11-01         5.4
2:  1 1992-11-01         5.8
3:  1 1992-11-02         5.2
4:  2 1992-11-04         5.6
5:  2 1992-11-04         6.0
6:  3 1992-11-06         6.0
7:  3 1992-11-06         7.0
8:  3 1992-11-08         8.0

I have already successfully done this via my own method:

DT[,FIRST.LAB_DT:=0]
DT[, FIRST.LAB_DT := c(1L, FIRST.LAB_DT[-1]), by = .(ID,LAB_DT)]

   ID     LAB_DT TEST_RESULT FIRST.LAB_DT
1:  1 1992-11-01         5.4            1
2:  1 1992-11-01         5.8            0
3:  1 1992-11-02         5.2            1
4:  2 1992-11-04         5.6            1
5:  2 1992-11-04         6.0            0
6:  3 1992-11-06         6.0            1
7:  3 1992-11-06         7.0            0
8:  3 1992-11-08         8.0            1

DT[FIRST.LAB_DT==1,]

   ID     LAB_DT TEST_RESULT FIRST.LAB_DT
1:  1 1992-11-01         5.4            1
2:  1 1992-11-02         5.2            1
3:  2 1992-11-04         5.6            1
4:  3 1992-11-06         6.0            1
5:  3 1992-11-08         8.0            1

However, the actual data.table I'm working with has ~8e6 rows and I recently found that you can subset data.table using the keys in a much faster manner.

Would anyone know how to produce the same final output using keys?

theneil
  • 488
  • 1
  • 4
  • 14
  • 3
    Nothing to do with keys, but I think it will be the fastest way to get the result: `DT[DT[, .I[which.min(TEST_RESULT)], .(ID, LAB_DT)]$V1]` – IceCreamToucan Sep 16 '19 at 20:41
  • 2
    @IceCreamToucan I second on this to be the the fastest. I think there was this question from *hadley* that asked about performance of subsetting in `data.table` and there were good discussions/answer on that thread. I'll see if I can find it. p.s. found it already. – M-- Sep 16 '19 at 21:33

2 Answers2

1

Since you report that your current approach works, your data is apparently sorted by increasing TEST_RESULT already. In this case, you can just drop duplicates to keep the first row per group:

unique(DT, by=c("ID", "LAB_DT"))

In general, if it is not sorted, you could roll from -Inf. Like your code, this selects only one row even if there are ties for the min value:

mDT = unique(DT[, .(ID, LAB_DT)])[, TEST_RESULT := -Inf]
DT[DT[mDT, on=names(mDT), roll=-Inf, which=TRUE]]

Finally, if you only have these three columns, you can just use the min function which is optimized for speed here (see ?GForce):

DT[, .(TEST_RESULT = min(TEST_RESULT)), by=.(ID, LAB_DT)]
Frank 2
  • 581
  • 2
  • 8
0
library(dplyr)
DT %>% 
group_by(ID,LAB_DT) %>% 
slice(which.min(TEST_RESULT))
TJ87
  • 404
  • 1
  • 3
  • 13