select only rows with max value over group of columns

Question

I have a dataset with ~ 100 mln rows, some kind of that DT

DT <- data.table(a = c(3,2,1,7,6,5), 
                 b = c("1","1","1","2","2","2"), 
                 c = c("2","2","2","3","3","3"), 
                 d = c(5,6,7,8,9,0))

For select only rows with max value over group (b,c), I use

DT[DT[, .I[which.max(a)], by = list(b,c)]$V1]

which gives

   a b c d
1: 3 1 2 5
2: 7 2 3 8

It works fine, but my question is maybe it's not a faster/optimal solution. Any advices are welcome!

See [this](http://stackoverflow.com/questions/31852294/how-to-speed-up-subset-by-groups) — David Arenburg, Feb 06 '17 at 10:34
@DavidArenburg should i mark my question as duplicate, i've missed this post, it hepls me a lot! thanks — Shin, Feb 06 '17 at 10:54

akrun · Answer 1 · 2017-02-06T10:13:34.887

0

Here is another option with order. We group by 'b', 'c' columns, order the rows based on the 'a' values in increasing order and get the last row using tail

DT[order(a), tail(.SD, 1) , .(b, c)]

or with setorder

setorder(DT, a)[, tail(.SD, 1), .(b, c)]

edited Feb 06 '17 at 10:13

answered Feb 06 '17 at 10:09

akrun

874,273
37
540
662

2

In documentation said, that you should avoid .SD for faster solution, but I'll try your's – Shin Feb 06 '17 at 10:13
1

@Shin That is true, but in your function, you are also using `which.max` – akrun Feb 06 '17 at 10:14

select only rows with max value over group of columns

1 Answers1