Remove all duplicate rows including the "reference" row

Question

I am looking for a way to remove all duplicate elements from a vector, including the reference element. By reference element I mean the element which is currently used in comparisons, to search for its duplicates. For instance, if we consider this vector:

a = c(1,2,3,3,4,5,6,7,7,8)

I would like to obtain:

b = c(1,2,4,5,6,8)

I am aware of duplicated() and unique() but they do not provide the result I am looking for.

Duplicate: [How can I remove all duplciates so that none are left in a data frame?](http://stackoverflow.com/q/13763216/903061) — Gregor Thomas, Oct 13 '15 at 16:52

score 5 · Accepted Answer · answered Feb 17 '14 at 14:40

5

Here's one way:

a[!(duplicated(a) | rev(duplicated(rev(a))))]
# [1] 1 2 4 5 6 8

answered Feb 17 '14 at 14:40

Josh O'Brien

159,210
26
366
455

nice. looks like it is doing exactly what I need. Thank you. Hope it works on data frames also, specifying which column, of course. – Marius Feb 17 '14 at 14:44
3

Should work if you specify the column, perhaps like this: `df[!(duplicated(df$a) | rev(duplicated(rev(df$a)))), ]`. Just be careful not to use `rev(df)` on the entire data.frame, as that will just reverse the order of the columns (not the rows). – Josh O'Brien Feb 17 '14 at 14:47
OK. Will be careful :] Thank you once again. – Marius Feb 17 '14 at 14:49

score 4 · Answer 2 · answered Feb 17 '14 at 15:58

I asked myself the same question (and i needed to do it quickly), so i came up with these solutions :

u =sample(x=1:10E6, size = 1000000, replace=T)
s1 <- function() setdiff(u, u[duplicated(u)])
s2 <- function() u[!duplicated(u) & !duplicated(u, fromLast=T)]
s3 <- function() u[!(duplicated(u) | rev(duplicated(rev(u))))]
s4 <- function() u[!u %in% u[duplicated(u)]]
s5 <- function() u[!match(u, u[duplicated(u)], nomatch = 0)]
s6 <- function() u[!is.element(u, u[duplicated(u)])]
s7 <- function() u[!duplicated2(u)]
library(rbenchmark)
benchmark(s1(), s2(), s3(), s4(), s5(), s6(), s7(),
          replications = 10,
          columns = c("test", "elapsed", "relative"),
          order = "elapsed")
     test elapsed relative
5 s5()    1.95    1.000
4 s4()    1.98    1.015
6 s6()    1.98    1.015
2 s2()    2.49    1.277
3 s3()    2.92    1.497
7 s7()    3.04    1.559
1 s1()    3.06    1.569

The choice is yours.

Thank you for writing down several ways of achieving this. I find it very useful to see the actual time it takes to complete the task. I am dealing with pretty big data files. — Marius, Feb 20 '14 at 08:47

Julien Navarre · Answer 3 · 2014-02-17T15:00:50.487

0

here is a solution to find the duplicated occurences with and their "original" occurences (and not only the duplicated occurences as with duplicated).

duplicated2 <- function(x){
  dup <- duplicated(x)
  if (sum(dup) == 0)
    return(dup)
  duplicated(c(x[dup], x))[-(1:sum(dup))]
}


a <- c(1,2,3,3,4,5,6,7,7,8)

> a[!duplicated2(a)]
[1] 1 2 4 5 6 8

edited Feb 17 '14 at 15:00

answered Feb 17 '14 at 14:45

Julien Navarre

7,653
3
42
69

Thank you, but the upper solution is much cleaner. – Marius Feb 17 '14 at 14:50

Remove all duplicate rows including the "reference" row

3 Answers3

Linked

Related