I have a table where the key is repeated a number of times, and one to select just one row for each key, using the largest value of another column.
This example demonstrates the solution I have at the moment:
N = 10
k = 2
DT = data.table(X = rep(1:N, each = k), Y = rnorm(k*N))
X Y
1: 1 -1.37925206
2: 1 -0.53837461
3: 2 0.26516340
4: 2 -0.04643483
5: 3 0.40331424
6: 3 0.28667275
7: 4 -0.30342327
8: 4 -2.13143267
9: 5 2.11178673
10: 5 -0.98047230
11: 6 -0.27230783
12: 6 -0.79540934
13: 7 1.54264549
14: 7 0.40079650
15: 8 -0.98474297
16: 8 0.73179201
17: 9 -0.34590491
18: 9 -0.55897393
19: 10 0.97523187
20: 10 1.16924293
> DT[, .SD[Y == max(Y)], by = X]
X Y
1: 1 -0.5383746
2: 2 0.2651634
3: 3 0.4033142
4: 4 -0.3034233
5: 5 2.1117867
6: 6 -0.2723078
7: 7 1.5426455
8: 8 0.7317920
9: 9 -0.3459049
10: 10 1.1692429
The problem is that for larger data.tables this take a very long time:
N = 10000
k = 25
DT = data.table(X = rep(1:N, each = k), Y = rnorm(k*N))
system.time(DT[, .SD[Y == max(Y)], by = X])
user system elapsed
9.69 0.00 9.69
My actual table about 100 million rows...
Can anyone suggest a more efficient solution?
Edit - importance of set key
The solution proposed works well, but you must use setkey, or have the DT ordered for it to work:
See Example without "each" in rep:
N = 10
k = 2
DT = data.table(X = rep(1:N, k), Y = rnorm(k*N))
DT[DT[, Y == max(Y), by = X]$V1,]
X Y
1: 1 1.26925708
2: 4 -0.66625732
3: 5 0.41498548
4: 8 0.03531185
5: 9 0.30608380
6: 1 0.50308578
7: 4 0.19848227
8: 6 0.86458423
9: 8 0.69825500
10: 10 -0.38160503