How do I eliminate duplicates in data.table using mult

Question

The solution to this simple problem has eluded me for several hours. I have a data table in which a value is identified by several classification variables (A, B, L). Where there are observations characterized by duplicate classification variables A & B, I want to retain the one that has the highest 'L'. So, if I have a table generated with this code

set.seed(17)
DT <- data.table(A=rep(c("a","b"),each=5), 
    B=c("a","b","c","d","d","a","b","b","c","d"), 
    L=c(1,1,1,2,1,1,1,2,1,1),
    val=rnbinom(10, size=2, mu=3))

Making the following:

A B L val
 1: a a 1   1
 2: a b 1  10
 3: a c 1   3
 4: a d 1   5
 5: a d 2   2
 6: b a 1   8
 7: b b 1   7
 8: b b 2   1
 9: b c 1   2
10: b d 1   2

I have tried commands such as

setkey(DT,A,B,L)
DT[ , .(A,B,L,val)  , mult="last"]

but I'm just not getting something. I want a resulting table that looks like this

 A B L val
 1: a a 1   1
 2: a b 1  10
 3: a c 1   3
 5: a d 2   2
 6: b a 1   8
 8: b b 2   1
 9: b c 1   2
10: b d 1   2

SethW.Bigelow, it sounds like you need `unique` or `!duplicated`. The `mult=` is typically used on merges; more precisely, `mult=` is used `when 'i' is a 'list' (or 'data.frame' or 'data.table') and _multiple_ rows in 'x' match to the row in 'i', 'mult' controls which are returned: '"all"' (default), '"first"' or '"last"'` (from `?data.table`). If I'm interpreting that correctly, since your `i` is empty, `mult` is not referenced. — r2evans, Mar 24 '21 at 17:55
Is this what you want: [Select the row with the maximum value in each group](https://stackoverflow.com/questions/24558328/select-the-row-with-the-maximum-value-in-each-group) (select row with max L, grouped by A and B) — Henrik, Mar 24 '21 at 18:08

score 1 · Accepted Answer · edited Mar 24 '21 at 18:05

1

DT[, lapply(.SD, last), .(A,B)])

should also work and seems to be a bit faster than the merge solution

edited Mar 24 '21 at 18:05

Yuriy Saraykin

8,390
1
7
14

answered Mar 24 '21 at 18:02

user12728748

8,106
2
9
14

Many fine solutions were offered but this one is the simplest-Seth – Seth W. Bigelow Mar 24 '21 at 18:44
It's only correct if the highest `L` comes last for each subgroup. The question doesn't state that to be necessarily true. – pseudospin Mar 24 '21 at 19:23
@pseudospin: L is the last key of DT in the code of the OP, so the highest key should come last by definition. One should include that line of his code, which I may not have made clear. Independent of that, how/if one would like to deal with possible ties of the highest L in a group is another question, which was not asked or addressed here. – user12728748 Mar 24 '21 at 20:17
Yep, that would work - you should add it to your answer. – pseudospin Mar 24 '21 at 21:25

score 0 · Answer 2 · answered Mar 24 '21 at 17:56

solution option

library(data.table)

dt <- structure(list(A = c("a", "a", "a", "a", "a", "b", "b", "b", 
                           "b", "b"), B = c("a", "b", "c", "d", "d", "a", "b", "b", "c", 
                                            "d"), L = c(1L, 1L, 1L, 1L, 2L, 1L, 1L, 2L, 1L, 1L), val = c(1L, 
                                                                                                         10L, 3L, 5L, 2L, 8L, 7L, 1L, 2L, 2L)), class = "data.frame", row.names = c(NA, 
                                                                                                                                                                                    -10L))

setDT(dt)                                                                                                                 
merge(dt[, list(L = last(L)), by =list(A, B)], dt)
#>    A B L val
#> 1: a a 1   1
#> 2: a b 1  10
#> 3: a c 1   3
#> 4: a d 2   2
#> 5: b a 1   8
#> 6: b b 2   1
#> 7: b c 1   2
#> 8: b d 1   2

^{Created on 2021-03-24 by the reprex package (v1.0.0)}

score 0 · Answer 3 · answered Mar 24 '21 at 18:00

set.seed(17)
library(data.table)
DT <- data.table(A=rep(c("a","b"),each=5), 
                 B=c("a","b","c","d","d","a","b","b","c","d"), 
                 L=c(1,1,1,2,1,1,1,2,1,1),
                 val=rnbinom(10, size=2, mu=3))

result <- DT[DT[, .I[L == max(L)], by = list(A, B)]$V1]

> result
   A B L val
1: a a 1   1
2: a b 1   1
3: a c 1   3
4: a d 2  12
5: b a 1   6
6: b b 2   2
7: b c 1   3
8: b d 1   5

score 0 · Answer 4 · answered Mar 24 '21 at 18:01

Here's how I'd do it (without mult)

DT[order(-L), .SD[1], .(A,B)]

With mult something like this would do it - note that Im doing an actual join here

DT[order(L)][unique(DT[, .(A, B)]), on = c('A', 'B'), mult = 'last']
#>    A B L val
#> 1: a a 1   1
#> 2: a b 1   1
#> 3: a c 1   3
#> 4: a d 2  12
#> 5: b a 1   6
#> 6: b b 2   2
#> 7: b c 1   3
#> 8: b d 1   5

How do I eliminate duplicates in data.table using mult

4 Answers4