Find duplicated rows with original

Question

I can get duplicated rows in R on a data.table dt using

dt[duplicated(dt, by=someColumns)]

However, I would like to get pairs of duplicated rows and the "non-duplicates", for example consider dt:

col1, col2, col3 
   A     B    C1
   A     B    C2
   A    B1    C1

Now, dt[duplicated(dt, by=c('col1', "col2")) would give me something along the lines of

col1, col2, col3
   A     B    C2

I would like to get this together with the row that it did not chose to be duplicated, that is

col1, col2, col3 
   A     B    C1
   A     B    C2

Speed comparison of answers:

> system.time(dt[duplicated(dt2, by = t) | duplicated(dt, by = t, fromLast = TRUE)])
   user  system elapsed 
  0.008   0.000   0.009 
> system.time(dt[, .SD[.N > 1], by = t])
   user  system elapsed 
 77.555   0.100  77.703

Where does this `by` argument come from? I don't see any `by` parameter in the help page for `?duplicated`. Are you using `duplicated()` from a package and not `base`? — Chase, Nov 02 '14 at 20:16
same logic applies, just updated my answer to reflect a `data.table` implementation. — Chase, Nov 02 '14 at 20:29

score 2 · Accepted Answer · edited May 23 '17 at 10:24

2

I believe this is essentially a duplicate of this question, though i can see how you may not have found it...

...here's an answer building off the logic outlined in the referenced question:

dt <- read.table(text = "col1 col2 col3 
   A     B    C1
   A     B    C2
   A    B1    C1", header = TRUE, stringsAsFactors = FALSE)


idx <- duplicated(dt[, 1:2]) | duplicated(dt[, 1:2], fromLast = TRUE)

dt[idx, ]
#---
  col1 col2 col3
1    A    B   C1
2    A    B   C2

Since you are using data.table, this is probably what you want:

library(data.table)
dt <- data.table(dt)
dt[duplicated(dt, by = c("col1", "col2")) | duplicated(dt, by = c("col1", "col2"), fromLast = TRUE)]
#---
   col1 col2 col3
1:    A    B   C1
2:    A    B   C2

edited May 23 '17 at 10:24

Community

1
1

answered Nov 02 '14 at 20:14

Chase

67,710
18
144
161

Using your code for `data.table`, I only get as output: `col1 col2 col3 1:A B C2`. I think this is because the `duplicate` function for `data.table` does not know `fromLast`. – FooBar Nov 02 '14 at 20:35
@FooBar - I'm guessing your data.table is out of date, version 1.9.4 clearly supports the `fromLast` parameter: http://www.rdocumentation.org/packages/data.table/functions/duplicated – Chase Nov 02 '14 at 20:54
Thats true. I'm freaking out about this, but my version is '1.9.2' and 'update.packages()' will not update it, apparently thinking its up to date. – FooBar Nov 02 '14 at 21:31

David Arenburg · Answer 2 · 2014-11-03T08:43:26.050

2

You can easily achieve this just by using .N:

dt[, .SD[.N > 1], by = list(col1, col2)]
##    col1 col2 col3
## 1:    A    B   C1
## 2:    A    B   C2

Edit:

You can also try to use binary search which is very efficient, though it seems like duplicated is still more efficient

setkey(dt[, indx := .N, by = list(col1, col2)], indx)[!J(1)]
##    col1 col2 col3
## 1:    A    B   C1
## 2:    A    B   C2

edited Nov 03 '14 at 08:43

answered Nov 02 '14 at 20:39

David Arenburg

91,361
17
137
196

This answer is "nicer" in terms of formulation. However, the nested logic makes it `8500` times slower than the other solution for my `data.table`. – FooBar Nov 02 '14 at 22:35
I've added a binary search solution which is much faster, but I agree that `duplicated` is very efficient. Though I really don't understand why couldn't just use Google. Chase (as he mentioed himself) just took his answer from [here](http://stackoverflow.com/questions/7854433/finding-all-duplicate-rows-including-elements-with-smaller-subscripts) – David Arenburg Nov 03 '14 at 08:44

Find duplicated rows with original

2 Answers2

Linked