0

I have a dataset with ~ 100 mln rows, some kind of that DT

DT <- data.table(a = c(3,2,1,7,6,5), 
                 b = c("1","1","1","2","2","2"), 
                 c = c("2","2","2","3","3","3"), 
                 d = c(5,6,7,8,9,0))

For select only rows with max value over group (b,c), I use

DT[DT[, .I[which.max(a)], by = list(b,c)]$V1]

which gives

   a b c d
1: 3 1 2 5
2: 7 2 3 8

It works fine, but my question is maybe it's not a faster/optimal solution. Any advices are welcome!

Shin
  • 251
  • 1
  • 3
  • 8

1 Answers1

0

Here is another option with order. We group by 'b', 'c' columns, order the rows based on the 'a' values in increasing order and get the last row using tail

DT[order(a), tail(.SD, 1) , .(b, c)]

or with setorder

setorder(DT, a)[, tail(.SD, 1), .(b, c)]
akrun
  • 874,273
  • 37
  • 540
  • 662