Select one row from each group in a large data.table based on a condition

Question

I have a table where the key is repeated a number of times, and one to select just one row for each key, using the largest value of another column.

This example demonstrates the solution I have at the moment:

N = 10
k = 2
DT = data.table(X = rep(1:N, each = k), Y = rnorm(k*N))
     X           Y
 1:  1 -1.37925206
 2:  1 -0.53837461
 3:  2  0.26516340
 4:  2 -0.04643483
 5:  3  0.40331424
 6:  3  0.28667275
 7:  4 -0.30342327
 8:  4 -2.13143267
 9:  5  2.11178673
10:  5 -0.98047230
11:  6 -0.27230783
12:  6 -0.79540934
13:  7  1.54264549
14:  7  0.40079650
15:  8 -0.98474297
16:  8  0.73179201
17:  9 -0.34590491
18:  9 -0.55897393
19: 10  0.97523187
20: 10  1.16924293
> DT[, .SD[Y == max(Y)], by = X]
     X          Y
 1:  1 -0.5383746
 2:  2  0.2651634
 3:  3  0.4033142
 4:  4 -0.3034233
 5:  5  2.1117867
 6:  6 -0.2723078
 7:  7  1.5426455
 8:  8  0.7317920
 9:  9 -0.3459049
10: 10  1.1692429

The problem is that for larger data.tables this take a very long time:

N = 10000
k = 25
DT = data.table(X = rep(1:N, each = k), Y = rnorm(k*N))
system.time(DT[, .SD[Y == max(Y)], by = X])
   user  system elapsed 
   9.69    0.00    9.69

My actual table about 100 million rows...

Can anyone suggest a more efficient solution?

Edit - importance of set key

The solution proposed works well, but you must use setkey, or have the DT ordered for it to work:

See Example without "each" in rep:

N = 10
k = 2
DT = data.table(X = rep(1:N, k), Y = rnorm(k*N))
DT[DT[, Y == max(Y), by = X]$V1,]
     X           Y
 1:  1  1.26925708
 2:  4 -0.66625732
 3:  5  0.41498548
 4:  8  0.03531185
 5:  9  0.30608380
 6:  1  0.50308578
 7:  4  0.19848227
 8:  6  0.86458423
 9:  8  0.69825500
10: 10 -0.38160503

This is because `.SD[..]` calls `[.data.table` for each group. We are aware of this, will most likely be taken optimised in 1.9.8. There are solutions on SO which get around this (usually using `.I`) you can find by searching. I'll update this post once optimised. — Arun, Jan 09 '15 at 14:49
@Arun The `cran` version is giving a different result/error compared to `devel` version for the `.SD` solution (tested by Khashaa and AnandaMahto). I am using the devel version. — akrun, Jan 09 '15 at 15:27
@akrun, corone, if you're on 1.9.4, do: `options(datatable.auto.index=FALSE)` and things should work fine. It was a bug in the new feature that was fixed in 1.9.5. — Arun, Jan 09 '15 at 15:44
@Corone, not sure if you are aware but your code would return multiple rows per unique X if max(Y) was duplicated for a given X. — talat, Jan 09 '15 at 16:19
@docendodiscimus yes, good point. In my actual case the data won't have ties (or it is as unlikely as getting a tie in the example), but yes the "condition" needs to be uniquely satisfied per group I suppose. — Corvus, Jan 09 '15 at 16:41
You could possibly give this a try: `unique(setorder(DT, -Y), by = X)`. For your larger sample data, the system.time on my laptop was all 0 for that approach. And it will guarantee exactly one row per X to be returnd, also in case of ties. — talat, Jan 09 '15 at 17:05

akrun · Accepted Answer · 2015-01-09T16:47:57.913

7

This would be faster compared to .SD

 system.time({setkey(DT, X)
    DT[DT[,Y==max(Y), by=X]$V1,]})
  # user  system elapsed 
  #0.016   0.000   0.016

Or

system.time(DT[DT[, .I[Y==max(Y)], by=X]$V1])
#  user  system elapsed 
# 0.023   0.000   0.023

If there are only two columns,

system.time(DT[,list(Y=max(Y)), by=X])
#   user  system elapsed 
#  0.006   0.000   0.007

Compared to,

system.time(DT[, .SD[Y == max(Y)], by = X] )
#  user  system elapsed 
# 2.946   0.006   2.962

Based on comments from @Khashaa, @AnandaMahto, the CRAN version (1.9.4) gives a different result for the .SD method compared to devel version (1.9.5) (which I used). You could get the same result for "CRAN" version (from @Arun's comments) by setting the options

 options(datatable.auto.index=FALSE)

NOTE: In case of "ties", the solutions described here will return multiple rows for each group (as mentioned by @docendo discimus). My solutions are based on the "code" posted by the OP.

If there are "ties", then you could use unique with by option (in case the number of columns are > 2)

 setkey(DT,X)
 unique(DT[DT[,Y==max(Y), by=X]$V1,], by=c("X", "Y"))

microbenchmarks

library(microbenchmark)
f1 <- function(){setkey(DT,X)[DT[, Y==max(Y), by=X]$V1,]}
f2 <- function(){DT[DT[, .I[Y==max(Y)], by=X]$V1]}
f3 <- function(){DT[, list(Y=max(Y)), by=X]}
f4 <- function(){DT[, .SD[Y==max(Y)], by=X]}
microbenchmark(f1(), f2(), f3(), f4(), unit='relative', times=20L)
#Unit: relative
# expr        min         lq       mean     median         uq        max neval
# f1()   2.794435   2.733706   3.024097   2.756398   2.832654   6.697893    20
# f2()   4.302534   4.291715   4.535051   4.271834   4.342437   8.114811    20
# f3()   1.000000   1.000000   1.000000   1.000000   1.000000   1.000000    20
# f4() 533.119480 522.069189 504.739719 507.494095 493.641512 466.862691    20
# cld
#  a 
#  a 
#  a 
#  b

data

N = 10000
k = 25
set.seed(25)
DT = data.table(X = rep(1:N, each = k), Y = rnorm(k*N))

edited Jan 09 '15 at 16:47

answered Jan 09 '15 at 14:46

akrun

874,273
37
540
662

Can I ask why `DT[, .SD[Y == max(Y)], by = X]` returns 2010 observations, whereas `DT[,list(Y=max(Y)), by=X]` returns 10000 (as it should)? – Khashaa Jan 09 '15 at 14:56
@Khashaa I get `DT1 <- DT[, .SD[Y == max(Y)], by = X];dim(DT1) #[1] 10000 2`. I am using `data.table_1.9.5` – akrun Jan 09 '15 at 14:59
One thing I just noticed - this relies on the `data.table` being keyed by X (which I had said it was, then realised it wasn't!). Turns out adding the key takes very little time though, so this is still a good solution. – Corvus Jan 09 '15 at 14:59
@Corone You could use `setkey`, which may improve the timings. I think a `microbenchmark` would give more info. – akrun Jan 09 '15 at 15:01
What I meant was that if you don't `setkey` you get the wrong answer. When you group by X that changes the order, if DT is not already keyed by X. – Corvus Jan 09 '15 at 15:03
@Corone Not sure how it will change the results. `V1` gives the `TRUE/FALSE` logical vector based on the original order (in the first solution) and it is used for subsetting the `DT`. In the second, `.I` gives the `row index` which was used for subsetting. – akrun Jan 09 '15 at 15:09
Strangely, the error persists when I tried on a different computer with `data.table_1.9.4`. DT1 has `X` values 1, 7, 16 ...etc – Khashaa Jan 09 '15 at 15:11
@Khashaa Can you try it on `devel` version? – akrun Jan 09 '15 at 15:15
1

And the error gone, with `devel` version:) – Khashaa Jan 09 '15 at 15:15
1

But I don't think we should rely on a devel version as a general resort. Are we clear on what defines the expected behavior? – A5C1D2H2I1M1N2O1R2T1 Jan 09 '15 at 15:18
@AnandaMahto I am using the devel version. Could you confirm if the error persists in the cran version? – akrun Jan 09 '15 at 15:20
Any insight on why CRAN version gives different result? – Khashaa Jan 09 '15 at 15:20
Yup. Same results as @Khashaa with the CRAN version. – A5C1D2H2I1M1N2O1R2T1 Jan 09 '15 at 15:22
@Khashaa Could you try `setkey(DT, X)[, .SD[Y == max(Y)], by = X]` – akrun Jan 09 '15 at 15:23
@akrun, Still 2010 observations.... – A5C1D2H2I1M1N2O1R2T1 Jan 09 '15 at 15:24
@AnandaMahto No idea why it is so. Perhaps Arun will give some insights.. – akrun Jan 09 '15 at 15:25
@akrun `V1` doesn't give `TRUE/FALSE` in the original order - it gives it in the order of the sorted key, because the by part has to sort it. See example I added to question where key is not ordered. – Corvus Jan 09 '15 at 15:35
@Corone Thanks for that. Sorry, I didn't test it on a unordered dataset. – akrun Jan 09 '15 at 15:38
1

@Corone If you use `.I`, it won't be a problem. – akrun Jan 09 '15 at 15:42
Oddly I only get the reduced set error on unkeyed DT. If I do setkey(DT) then I get 10000 results. Although I'm getting a slightly random number of results, around 2020-2040. – Corvus Jan 09 '15 at 15:43
I'm not a data.table expert, but wouldn't the first two solutions in your answer return multiple rows per group in case of ties? – talat Jan 09 '15 at 16:12
@docendodiscimus Yes, it would return that. My solution was based on the OP's code and the `ties` issue was not discussed in the question.. – akrun Jan 09 '15 at 16:13
Sure, no problem.. I just wondered about that because the title is asking for one row explicitly. I commented on the question – talat Jan 09 '15 at 16:25
@docendodiscimus I did edited as a "NOTE". But, I think he wanted to get a faster solution for the same code he wrote. – akrun Jan 09 '15 at 16:27

Select one row from each group in a large data.table based on a condition

1 Answers1

microbenchmarks

data