12

I'd like to remove all items that appear more than once in a vector. Specifically, this includes character, numeric and integer vectors. Currently, I'm using duplicated() both forwards and backward (using the fromLast parameter).

Is there a more computationally efficient (faster) way to execute this in R? The solution below is simple enough to write/read, but it seems inefficient to execute the duplicate search twice. Perhaps a counting-based method using an additional data structure would be better?

Example:

d <- c(1,2,3,4,1,5,6,4,2,1)
d[!(duplicated(d) | duplicated(d, fromLast=TRUE))]
#[1] 3 5 6

Related SO posts here and here.

Community
  • 1
  • 1
Megatron
  • 15,909
  • 12
  • 89
  • 97
  • How about the `unique()` function? – Psidom May 10 '16 at 20:45
  • Doesn't work: `unique(d)` yields `1 2 3 4 5 6` instead of the correct `3 5 6` – Megatron May 10 '16 at 20:47
  • 1
    on top of my head: `as.integer(names(table(d)[table(d)==1]))`, not sure if that will actually be more efficient – SeaSprite May 10 '16 at 20:55
  • 1
    This also works, though I'm not sure it is more computationally efficient: `d[!(d %in% d[duplicated(d)])]`. – lmo May 10 '16 at 21:02
  • `setdiff(d, d[duplicated(d)])` , idea borrowed from @lmo – SeaSprite May 10 '16 at 21:06
  • What structure do your actual data have? Small integers like your example? See, e.g., `which(tabulate(d) == 1L)` – alexis_laz May 10 '16 at 21:11
  • Possible duplicate of [Counting the number of elements with the values of x in a vector](http://stackoverflow.com/questions/1923273/counting-the-number-of-elements-with-the-values-of-x-in-a-vector) – Bulat May 10 '16 at 21:12
  • I second the request for a more complete specification of the type of data in the vector. This will impact the optimal strategy if performance really is of interest. – joran May 10 '16 at 21:12
  • @Bulat definitely no – Señor O May 10 '16 at 21:13
  • @Bulat also disagree – Megatron May 10 '16 at 21:14
  • As foggy as the question seems... if only want numbers that appear once using dplyr and applying a chain..... `data.frame(nth = d,stringsAsFactors = F)%>%count(nth) %>% filter(n == 1)%>% `\`$`\`(nth) ` – Carl Boneri May 10 '16 at 21:15
  • 2
    `duplicated` is an internal function and supports long vectors, I dont think calling it twice is as inefficient as you might think. it is faster than using the set operations solutions suggested – rawr May 10 '16 at 21:16
  • 2
    using `duplicated.default` cuts down on time by about 50% if that's the concern – Señor O May 10 '16 at 21:17
  • 1
    @Bulat two answers having something in common doesn't mean the questions are duplicates – Señor O May 10 '16 at 21:18
  • 2
    @alexis_laz works only when d are strictly positive integers, otherwise `{ ud = unique(d); ud[tabulate(match(d, ud)) == 1L] }`. – Martin Morgan May 10 '16 at 21:21
  • 1
    no problems, that is why you need 5 votes – Bulat May 10 '16 at 21:21
  • With "integer"s you can use `tabulate`s counting; otherwise, I guess, you can't avoid some sort of hashing/sorting – alexis_laz May 10 '16 at 21:27
  • if you have a big dataset have a look at `data.table` – Bulat May 11 '16 at 08:19

3 Answers3

12

Some timings:

set.seed(1001)
d <- sample(1:100000, 100000, replace=T)
d <- c(d, sample(d, 20000, replace=T))  # ensure many duplicates
mb <- microbenchmark::microbenchmark(
  d[!(duplicated(d) | duplicated(d, fromLast=TRUE))],
  setdiff(d, d[duplicated(d)]),
  {tmp <- rle(sort(d)); tmp$values[tmp$lengths == 1]},
  as.integer(names(table(d)[table(d)==1])),
  d[!(duplicated.default(d) | duplicated.default(d, fromLast=TRUE))],
  d[!(d %in% d[duplicated(d)])],
  { ud = unique(d); ud[tabulate(match(d, ud)) == 1L] },
  d[!(.Internal(duplicated(d, F, F, NA)) | .Internal(duplicated(d, F, T, NA)))]
)
summary(mb)[, c(1, 4)]  # in milliseconds
#                                                                                expr      mean
#1                               d[!(duplicated(d) | duplicated(d, fromLast = TRUE))]  18.34692
#2                                                       setdiff(d, d[duplicated(d)])  24.84984
#3                       {     tmp <- rle(sort(d))     tmp$values[tmp$lengths == 1] }   9.53831
#4                                         as.integer(names(table(d)[table(d) == 1])) 255.76300
#5               d[!(duplicated.default(d) | duplicated.default(d, fromLast = TRUE))]  18.35360
#6                                                      d[!(d %in% d[duplicated(d)])]  24.01009
#7                        {     ud = unique(d)     ud[tabulate(match(d, ud)) == 1L] }  32.10166
#8 d[!(.Internal(duplicated(d, F, F, NA)) | .Internal(duplicated(d,      F, T, NA)))]  18.33475

Given the comments let's see if they are all correct?

 results <- list(d[!(duplicated(d) | duplicated(d, fromLast=TRUE))],
         setdiff(d, d[duplicated(d)]),
         {tmp <- rle(sort(d)); tmp$values[tmp$lengths == 1]},
         as.integer(names(table(d)[table(d)==1])),
         d[!(duplicated.default(d) | duplicated.default(d, fromLast=TRUE))],
         d[!(d %in% d[duplicated(d)])],
         { ud = unique(d); ud[tabulate(match(d, ud)) == 1L] },
         d[!(.Internal(duplicated(d, F, F, NA)) | .Internal(duplicated(d, F, T, NA)))])
 all(sapply(ls, all.equal, c(3, 5, 6)))
 # TRUE
Megatron
  • 15,909
  • 12
  • 89
  • 97
Raad
  • 2,675
  • 1
  • 13
  • 26
  • 4
    might as well go all out `d[!(.Internal(duplicated(d, FALSE, FALSE, NA)) | .Internal(duplicated(d, FALSE, TRUE, NA)))]` – rawr May 10 '16 at 21:26
  • 2
    Best to check first for correctness (identity with test case), then robustness (character, integer, numeric vectors, from the updated question; the names(tables(...)) solution fails) and finally timing. – Martin Morgan May 10 '16 at 21:30
  • Yessssss second place – Señor O May 10 '16 at 21:35
  • 2
    Wouldn't it be more clean to benchmark on larger vectors? And, also, `do.call(all.equal, results)` compares only the first two elements of "results"; the other elements are passed as other arguments – alexis_laz May 10 '16 at 21:45
  • So it seems that my solution came first in the end ) – Bulat May 11 '16 at 06:54
  • It is actually misleading, use at least 10^6 as sample, as this is a single vector I would even go for 10^8. Results are too different! – Bulat May 11 '16 at 07:21
7

You can do this with rle function:

tmp <- rle(sort(d))
res <- tmp$values[tmp$lengths == 1]

The idea is to find the count of same values in the vector.

There are plenty of alternatives here: Counting the number of elements with the values of x in a vector

Edit

After looking at the benchmarks, @NBATrends I got suspicious. In theory counting items with a single pass through must be ~2x faster compared to original duplicated logic.

I tried doing this with data.table:

library(data.table)
dt <- data.table(d)
res <-  dt[, count:= .N, by = d][count == 1]$d

And here are the benchmarks on different sample sizes for three solutions (I have reduced it to fast unique approaches):

benchmarking

You can see that with the growth of the sample data.table begins to outperform other methods (2x).

Here is the code to reproduce:

set.seed(1001)
N <- c(3, 4, 5, 6 ,7)
n <- 10^N
res <- lapply(n, function(x) {
d <- sample(1:x/10, 5 * x, replace=T)
d <- c(d, sample(d, x, replace=T))  # ensure many duplicates
dt <- data.table(d)
mb <- microbenchmark::microbenchmark(
  "duplicated(original)" = d[!(duplicated(d) | duplicated(d, fromLast=TRUE))],
  "tabulate" = { ud = unique(d); ud[tabulate(match(d, ud)) == 1L] },
  "data.table" = dt[, count:= .N, by = d][count == 1]$d,
  times = 1,unit = "ms")
sm <- summary(mb)[, c(1, 4, 8)]
sm$size = x
return(sm)

})

res <- do.call("rbind", res)

require(ggplot2)
##The values Year, Value, School_ID are
##inherited by the geoms
ggplot(res, aes(x = res$size, y = res$mean, colour=res$exp)) + 
geom_line() + scale_x_log10() + scale_y_log10() +
geom_point() 
Community
  • 1
  • 1
Bulat
  • 6,869
  • 1
  • 29
  • 52
0

You could use a set operation:

d <- c(1,2,3,4,1,5,6,4,2,1)
duplicates = d[duplicated(d)]
setdiff(d, duplicates)
[1] 3 5 6

(Not certain if that is more efficient than the above code but it does seem conceptually cleaner)

djq
  • 14,810
  • 45
  • 122
  • 157