3

In the following data table I'd like to select rows with unique id and the smallest value from the column pos2:

dt = data.table(id = c(1,2,2,3,3,3),
            pos1 = c(0.1, 0.2, 0.2, 0.3, 0.3, 0.3),
            pos2 = c(0.1, 0.25, 0.21, 0.34, 0.31, 0.32))

   id pos1 pos2
1:  1  0.1 0.10
2:  2  0.2 0.25
3:  2  0.2 0.21
4:  3  0.3 0.34
5:  3  0.3 0.31
6:  3  0.3 0.32

The way I'm doing it now is by creating an intermediate table:

dt.red = dt[, .(pos2 = first(sort(pos2))), by = id]

   id pos2
1:  1 0.10
2:  2 0.21
3:  3 0.31

Then I merge to obtain the desired end result:

merge(dt, dt.red)

   id pos2 pos1
1:  1 0.10  0.1
2:  2 0.21  0.2
3:  3 0.31  0.3

Is there a cleaner way of achieving that with data.table?

mattek
  • 903
  • 1
  • 6
  • 18

2 Answers2

6

It's also possible to do this without .I, but it'll be slower*

dt[order(pos2), head(.SD, 1), id]
#    id pos1 pos2
# 1:  1  0.1 0.10
# 2:  2  0.2 0.21
# 3:  3  0.3 0.31

*Or maybe not, see comments below

IceCreamToucan
  • 28,083
  • 2
  • 22
  • 38
  • 1
    It might not be much slower. There's optimization built in for it anyway, visible with `dt[order(pos2), head(.SD, 1), by=id, verbose=TRUE]` – Frank Mar 21 '19 at 14:31
  • Yeah, maybe my description here was too strong. I was basing that statement off of the benchmark in [this question](https://stackoverflow.com/questions/50093919/dplyrslice-in-data-table). Make of that benchmark what you will, it's a little different since there's no `order()`, and in any case the test data may be too small to make any conclusions. – IceCreamToucan Mar 21 '19 at 15:13
  • 1
    Ok. `head(.SD, 1)` has been GForce-optimized for a while, but only for n=1. That might've been extended by https://github.com/Rdatatable/data.table/pull/3463 recently, but I'm not sure. – Frank Mar 21 '19 at 15:16
  • Ok, I see. That benchmark doesn't reflect the optimization which would be used in this case, since it only tests `n > 1` – IceCreamToucan Mar 21 '19 at 15:24
  • 1
    @Frank no it wasn't yet. This PR handles missing `n` to fill with `6L`, which previously errors, but that still disables gforce – jangorecki Mar 22 '19 at 04:04
5

We can use .I to get the row index and use that in i for subsetting the rows

dt[dt[order(pos2), .I[1], by = id]$V1]
#   id pos1 pos2
#1:  1  0.1 0.10
#2:  2  0.2 0.21
#3:  3  0.3 0.31

Or with setorder

setorder(dt, id, pos2)[, .SD[1L], id]
akrun
  • 874,273
  • 37
  • 540
  • 662