3

I can get duplicated rows in R on a data.table dt using

dt[duplicated(dt, by=someColumns)] 

However, I would like to get pairs of duplicated rows and the "non-duplicates", for example consider dt:

col1, col2, col3 
   A     B    C1
   A     B    C2
   A    B1    C1

Now, dt[duplicated(dt, by=c('col1', "col2")) would give me something along the lines of

col1, col2, col3
   A     B    C2

I would like to get this together with the row that it did not chose to be duplicated, that is

col1, col2, col3 
   A     B    C1
   A     B    C2

Speed comparison of answers:

> system.time(dt[duplicated(dt2, by = t) | duplicated(dt, by = t, fromLast = TRUE)])
   user  system elapsed 
  0.008   0.000   0.009 
> system.time(dt[, .SD[.N > 1], by = t])
   user  system elapsed 
 77.555   0.100  77.703 
FooBar
  • 15,724
  • 19
  • 82
  • 171
  • Where does this `by` argument come from? I don't see any `by` parameter in the help page for `?duplicated`. Are you using `duplicated()` from a package and not `base`? – Chase Nov 02 '14 at 20:16
  • @Chase `dt` is a `data.table`, I forgot to mention that. – FooBar Nov 02 '14 at 20:22
  • same logic applies, just updated my answer to reflect a `data.table` implementation. – Chase Nov 02 '14 at 20:29

2 Answers2

2

I believe this is essentially a duplicate of this question, though i can see how you may not have found it...

...here's an answer building off the logic outlined in the referenced question:

dt <- read.table(text = "col1 col2 col3 
   A     B    C1
   A     B    C2
   A    B1    C1", header = TRUE, stringsAsFactors = FALSE)


idx <- duplicated(dt[, 1:2]) | duplicated(dt[, 1:2], fromLast = TRUE)

dt[idx, ]
#---
  col1 col2 col3
1    A    B   C1
2    A    B   C2

Since you are using data.table, this is probably what you want:

library(data.table)
dt <- data.table(dt)
dt[duplicated(dt, by = c("col1", "col2")) | duplicated(dt, by = c("col1", "col2"), fromLast = TRUE)]
#---
   col1 col2 col3
1:    A    B   C1
2:    A    B   C2
Community
  • 1
  • 1
Chase
  • 67,710
  • 18
  • 144
  • 161
  • Using your code for `data.table`, I only get as output: `col1 col2 col3 1:A B C2`. I think this is because the `duplicate` function for `data.table` does not know `fromLast`. – FooBar Nov 02 '14 at 20:35
  • @FooBar - I'm guessing your data.table is out of date, version 1.9.4 clearly supports the `fromLast` parameter: http://www.rdocumentation.org/packages/data.table/functions/duplicated – Chase Nov 02 '14 at 20:54
  • Thats true. I'm freaking out about this, but my version is '1.9.2' and 'update.packages()' will not update it, apparently thinking its up to date. – FooBar Nov 02 '14 at 21:31
2

You can easily achieve this just by using .N:

dt[, .SD[.N > 1], by = list(col1, col2)]
##    col1 col2 col3
## 1:    A    B   C1
## 2:    A    B   C2

Edit:

You can also try to use binary search which is very efficient, though it seems like duplicated is still more efficient

setkey(dt[, indx := .N, by = list(col1, col2)], indx)[!J(1)]
##    col1 col2 col3
## 1:    A    B   C1
## 2:    A    B   C2
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
  • This answer is "nicer" in terms of formulation. However, the nested logic makes it `8500` times slower than the other solution for my `data.table`. – FooBar Nov 02 '14 at 22:35
  • I've added a binary search solution which is much faster, but I agree that `duplicated` is very efficient. Though I really don't understand why couldn't just use Google. Chase (as he mentioed himself) just took his answer from [here](http://stackoverflow.com/questions/7854433/finding-all-duplicate-rows-including-elements-with-smaller-subscripts) – David Arenburg Nov 03 '14 at 08:44