38

if I understand correctly, duplicated() function for data.table returns a logical vector which doesn't contain first occurrence of duplicated record. What is the best way to mark this first occurrence as well? In case of base::duplicated(), I solved this simply by disjunction with reversed order function: myDups <- (duplicated(x) | duplicated(x, fromLast=TRUE)) - but in data.table::duplicated(), fromLast=TRUE is not included (I don't know why)...

P.S. ok, here's a primitive example

myDT <- fread(
"id,fB,fC
 1, b1,c1
 2, b2,c2
 3, b1,c1
 4, b3,c3
 5, b1,c1
")
setkeyv(myDT, c('fB', 'fC'))
myDT[, fD:=duplicated(myDT)]

rows 1, 3 and 5 are all duplicates but only 3 and 5 will be included in duplicated while I need to mark all of them.

UPD. important notice: the answer I've accepted below works only for keyed table. If you want to find duplicate records considering all columns, you have to setkey all these columns explicitly. So far I use the following workaround specifically for this case:

dups1 <- duplicated(myDT);
dups2 <- duplicated(myDT, fromLast=T);
dups <- dups1 | dups2;
Vasily A
  • 8,256
  • 10
  • 42
  • 76
  • Can there be >1 *type* as well as >1 *token*? – gung - Reinstate Monica Oct 15 '13 at 22:42
  • excuse me gung, I don't get the question - what are _type_ and _token_ here? – Vasily A Oct 15 '13 at 22:45
  • 1
    [Type token distinction](http://en.wikipedia.org/wiki/Type%E2%80%93token_distinction); sorry, what I mean is that in your example, there is only 1 kind of row that's duplicated (ie, "b1,c1"). But could there be, eg, "6, b2,c2" & if so, would you want to get that also? – gung - Reinstate Monica Oct 15 '13 at 22:54
  • thanks for pointing! Yes, sure, multiple "b2,c2" should be included as well - as any record which has duplicates. – Vasily A Oct 15 '13 at 23:02
  • 1
    @VasilyA The accepted answer does not rely on the key being set. You can check for duplicates against any set of columns you like, irrespective of the key, like `duplicated(x, by=my_cols)` or `duplicated(x, by=NULL)`. The behavior in the latter case is covered in the docs and seems to be what you're after regarding the key. For your `nrow(myDT):1` idea, use the `fromLast` argument. Oh wait, I see you're right about the accepted answer, but I think my point about fromLast may be helpful. – Frank Nov 13 '15 at 03:42
  • first, thanks for pointing to `fromLast`! I felt that there must be better way than `nrow(...):1`, – Vasily A Nov 13 '15 at 03:55
  • but for my specific case, I still don't see how this will work (I mean, to detect all records having duplicates) without explicitly indicating all columns. For example, take the sample table in my question and change the last row to be "4, b1, c1" - so it will be a duplicate of the first row, i.e. the answer I need will be `TRUE FALSE FALSE FALSE TRUE`, but `dups1 <- duplicated(myDT)` will give `FALSE FALSE FALSE FALSE TRUE` and using the code from the answer - `dups1 | c(tail(dups1, -1), FALSE)` - will give `FALSE FALSE FALSE TRUE TRUE`. – Vasily A Nov 13 '15 at 03:57
  • Yeah, props to the answer below for somehow calling `duplicated` only once, but I would do as you have done above `duplicated(x) | duplicated(x, fromLast=TRUE)`, which is the R idiom. mnel used `duplicated.data.frame` in his answer below because the option wasn't yet available in `duplicated.data.table` (came in Oct 2014: https://github.com/Rdatatable/data.table ) – Frank Nov 13 '15 at 04:09
  • aha, now I see. Thanks for the explanation! – Vasily A Nov 13 '15 at 04:30

4 Answers4

14

This appears to work:

> myDT[unique(myDT),fD:=.N>1]
> myDT
   id  fB fC    fD
1:  1  b1 c1  TRUE
2:  3  b1 c1  TRUE
3:  5  b1 c1  TRUE
4:  2  b2 c2 FALSE
5:  4  b3 c3 FALSE

Thanks to @flodel, the better way to do it is this:

> myDT[, fD := .N > 1, by = key(myDT)]
> myDT
   id  fB fC    fD
1:  1  b1 c1  TRUE
2:  3  b1 c1  TRUE
3:  5  b1 c1  TRUE
4:  2  b2 c2 FALSE
5:  4  b3 c3 FALSE

The difference in efficiency is substantial:

> microbenchmark(
    key=myDT[, fD := .N > 1, by = key(myDT)],
    unique=myDT[unique(myDT),fD:=.N>1])
Unit: microseconds
   expr      min       lq    median        uq       max neval
    key  679.874  715.700  735.0575  773.7595  1825.437   100
 unique 1417.845 1485.913 1522.7475 1567.9065 24053.645   100

Especially for the max. What's going on there?

mrip
  • 14,913
  • 4
  • 40
  • 58
  • thanks, it works indeed. I just wanted to avoid `unique` because it returns the whole table instead of just logical vector, so I was afraid that it can be not the best way when working with big tables. – Vasily A Oct 15 '13 at 23:06
14

Many years ago this was the fastest answer by a large margin (see revision history if interested):

dups = duplicated(myDT, by = key(myDT));
myDT[, fD := dups | c(tail(dups, -1), FALSE)]

There have been a lot of internal changes since then however, that have made many options about the same order:

myDT <- data.table(id = sample(1e6), 
                   fB = sample(seq_len(1e3), size= 1e6, replace=TRUE), 
                   fC = sample(seq_len(1e3), size= 1e6,replace=TRUE ))
setkey(myDT, fB, fC)

microbenchmark(
   key=myDT[, fD := .N > 1, by = key(myDT)],
   unique=myDT[unique(myDT, by = key(myDT)),fD:=.N>1], 
   dup = myDT[,fD := duplicated.data.frame(.SD)|duplicated.data.frame(.SD, fromLast=TRUE),
              .SDcols = key(myDT)],
   dup2 = {dups = duplicated(myDT, by = key(myDT)); myDT[, fD := dups | c(tail(dups, -1L), FALSE)]},
   dup3 = {dups = duplicated(myDT, by = key(myDT)); myDT[, fD := dups | c(dups[-1L], FALSE)]},
   times=10)

#   expr       min        lq      mean    median        uq       max neval
#    key  523.3568  567.5372  632.2379  578.1474  678.4399  886.8199    10
# unique  189.7692  196.0417  215.4985  210.5258  224.4306  290.2597    10
#    dup 4440.8395 4685.1862 4786.6176 4752.8271 4900.4952 5148.3648    10
#   dup2  143.2756  153.3738  236.4034  161.2133  318.1504  419.4082    10
#   dup3  144.1497  150.9244  193.3058  166.9541  178.0061  460.5448    10
eddi
  • 49,088
  • 6
  • 104
  • 155
  • 3
    wow, amazing! If I understand correctly, here we use the fact that after `setkey` our duplicates are grouped, so we can simply "shift" our `duplicated` vector by one position - is it right? – Vasily A Oct 16 '13 at 21:05
  • 1
    And you make this even faster by avoiding the call to `tail`, and the associated method dispatch time. (I've added this to your answer) – mnel Oct 16 '13 at 22:23
  • 3
    (not sure if someone can notice comments to so old posts) - I have just realized that all those approaches work *only* for keyed tables, i.e. if I want to include all columns to be compared, I must explicitly key all of them. Please correct me if I'm wrong here. – Vasily A Nov 09 '15 at 04:30
  • @VasilyA that's correct, or use the new 'on' argument – eddi Nov 09 '15 at 14:18
7

A third approach, (that appears more efficient for this small example)

You can explicitly call duplicated.data.frame....

myDT[,fD := duplicated.data.frame(.SD)|duplicated.data.frame(.SD, fromLast=TRUE),
  .SDcols = key(myDT)]


 microbenchmark(
   key=myDT[, fD := .N > 1, by = key(myDT)],
   unique=myDT[unique(myDT),fD:=.N>1], 
  dup = myDT[,fD := duplicated.data.frame(.SD)|duplicated.data.frame(.SD, fromLast=TRUE), 
    .SDcols = key(myDT)])
## Unit: microseconds
##    expr      min        lq   median        uq       max neval
##     key  556.608  575.9265  588.906  600.9795 27713.242   100
##  unique 1112.913 1164.8310 1183.244 1216.9000  2263.557   100
##     dup  420.173  436.3220  448.396  461.3750   699.986   100

If we expand the size of the sample data.table, then the key approach is the clear winner

 myDT <- data.table(id = sample(1e6), 
  fB = sample(seq_len(1e3), size= 1e6, replace=TRUE), 
  fC = sample(seq_len(1e3), size= 1e6,replace=TRUE ))
setkeyv(myDT, c('fB', 'fC'))

microbenchmark(
  key=myDT[, fD := .N > 1, by = key(myDT)],
  unique=myDT[unique(myDT),fD:=.N>1], 
  dup = myDT[,fD := duplicated.data.frame(.SD)|duplicated.data.frame(.SD, fromLast=TRUE),
   .SDcols = key(myDT)],times=10)
## Unit: milliseconds
##    expr       min        lq    median        uq       max neval
##     key  355.9258  358.1764  360.7628  450.9218  500.8360    10
##  unique  451.3794  458.0258  483.3655  519.3341  553.2515    10
##     dup 1690.1579 1721.5784 1775.5948 1826.0298 1845.4012    10
mnel
  • 113,303
  • 27
  • 265
  • 254
7

As of data.table version 1.9.8, the solution by eddi needs to be modified to be:

dups = duplicated(myDT, by = key(myDT));
myDT[, fD := dups | c(tail(dups, -1), FALSE)]

since:

Changes in v1.9.8 (on CRAN 25 Nov 2016)

POTENTIALLY BREAKING CHANGES

By default all columns are now used by unique(), duplicated() and uniqueN() data.table methods, #1284 and #1841. To restore old behaviour: options(datatable.old.unique.by.key=TRUE). In 1 year this option to restore the old default will be deprecated with warning. In 2 years the option will be removed. Please explicitly pass by=key(DT) for clarity. Only code that relies on the default is affected. 266 CRAN and Bioconductor packages using data.table were checked before release. 9 needed to change and were notified. Any lines of code without test coverage will have been missed by these checks. Any packages not on CRAN or Bioconductor were not checked.

Alex
  • 15,186
  • 15
  • 73
  • 127
  • Does this require to use setkey on the columns we want to use to select unique rows? Is there a way to directly indicate the columns in the command? – Herman Toothrot May 05 '23 at 12:51