Filter out rows of group that don't fit criteria

Question

Here is the code to use for this question:

set.seed(1337)
myDT <- data.table(Key1 = sample(letters, 500, replace = TRUE),
                   Key2 = sample(LETTERS[1:5], 500, TRUE),
                   Data = sample(1:26, 500, replace = TRUE))
setkey(myDT, Key1, Key2)
# showing what myDT looks like
> myDT
     Key1 Key2 Data
  1:    a    A    6
  2:    a    A    3
  3:    a    B    2
  4:    a    B   20
  5:    a    B   13
 ---               
496:    z    D   23
497:    z    E    3
498:    z    E   18
499:    z    E   11
500:    z    E    2

I would like to pair down myDT to take only the largest Data values for each Key1, Key2 pair. E.g. (using (Key1,Key2) to denote a pair) for (a,A) I would like to get rid of the row where Data is 3 and keep the row where Data is 6. For (z,E) I would like to keep only the row where Data is 18.

While typing out this question, a solution came to me (which I'll post below) but please help me know how you would approach this problem.

jangorecki · Accepted Answer · 2015-12-10T12:36:27.690

My answer

myDT[order(-Data), head(.SD, 1), by = .(Key1, Key2)]
# if you are on 1.9.6 or lower use this one
myDT[order(-Data), .SD[1], by = .(Key1, Key2)]

Or from comments

unique(myDT[order(-Data)], by = c("Key1", "Key2"))

Benchmark on 50M rows.

library(dplyr)
library(data.table)
library(microbenchmark)
set.seed(1337)
n = 5e7
myDT <- data.table(Key1 = sample(letters, n, replace = TRUE),
                   Key2 = sample(LETTERS[1:5], n, TRUE),
                   Data = sample(1:26, n, replace = TRUE))
setkey(myDT, Key1, Key2)

microbenchmark(times = 10L,
               CathG = myDT[, .SD[which.max(Data)], by = .(Key1, Key2)],
               jangorecki = myDT[order(-Data), head(.SD, 1), by = .(Key1, Key2)],
               jangorecki.keeporder = myDT[order(-Data), head(.SD, 1), keyby = .(Key1, Key2)],
               nist = myDT %>% group_by(Key1,Key2) %>% summarise(Data = max(Data)),
               David = unique(myDT[order(-Data)], by = c("Key1", "Key2")))

#Unit: milliseconds
#                 expr       min        lq      mean   median        uq       max neval
#                CathG  659.6150  689.3035  733.9177  739.795  780.0075  811.1456    10
#           jangorecki 2844.7565 3026.3385 3089.6764 3097.332 3219.1951 3343.9919    10
# jangorecki.keeporder 2935.3733 3194.1606 3232.9297 3214.581 3308.0735 3411.4319    10
#                 nist  803.1921  844.5002 1011.7878 1007.755 1188.6127 1228.3869    10
#                David 3410.4853 3501.5918 3590.2382 3590.190 3652.8091 3803.9038    10

Previously posted benchmark on small data shows much different results, so I would say it heavily depends on data, not just volume but also cardinality (count of unique values) - maybe even more in some cases.

score 4 · Answer 2 · edited May 23 '17 at 11:44

4

Another way to do it, based on this Q is:

 myDT[, .SD[which.max(Data)], by = .(Key1, Key2)]
 #    Key1 Key2 Data
 # 1:    a    A    6
 # 2:    a    B   20
 # 3:    a    C   25
 # 4:    a    E    7
 # 5:    b    A   25
 #---               
#119:    z    A   23
#120:    z    B   26
#121:    z    C   24
#122:    z    D   25
#123:    z    E   18

edited May 23 '17 at 11:44

Community

1
1

answered Dec 09 '15 at 15:56

Cath

23,906
5
52
86

This looks right, but I can't get it to terminate. My data has 120k rows and 64 columns. Would it be wise to pull out the three columns, do this operation, and then merge them back in? – jks612 Dec 09 '15 at 16:24
I used the link you got to use Matt Dowle's efficient implementation. Why is this version inefficient? – jks612 Dec 09 '15 at 16:33
@jks612 `[` on `.SD` adds some overhead, see my answer and benchmarks – jangorecki Dec 10 '15 at 01:22
@jks612 did you try `myDT[myDT[,.I[which.max(Data)],by=.(Key1,Key2)][['V1']]]` ? You should go with jangorecki's answer, it's both nice and efficient! :-) – Cath Dec 10 '15 at 07:48

score 2 · Answer 3 · edited Dec 10 '15 at 12:41

2

A faster and nicer way to solve it using dplyr

myDT %>% group_by(Key1,Key2) %>% summarise(Data = max(Data))

To keep all existing columns in the data, you can use slice instead of summarise:

myDT %>% group_by(Key1,Key2) %>% slice(which.max(Data))

Note that this will return exactly 1 row per group and in case of ties, it will be the first maximum row of column Data.

edited Dec 10 '15 at 12:41

talat

68,970
21
126
157

answered Dec 09 '15 at 15:58

nist

1,706
3
16
24

How would you merge in any missing columns? myDT <- data.table(Key1 = sample(letters, 500, replace = TRUE), Key2 = sample(LETTERS[1:5], 500, TRUE), superfluous = 1:500, Data = sample(1:26, 500, replace = TRUE)) – jks612 Dec 09 '15 at 16:28
@jks612, see the updated answer on how to keep other columns – talat Dec 10 '15 at 12:42

Filter out rows of group that don't fit criteria

3 Answers3