Why does R's duplicated perform better on sorted data?

Question

While comparing the efficiency of two functions in an answer to Check if list contains another list in R, I stumbled upon an interesting result. Sorting greatly increases the efficiency of duplicated when the vector is large. This came as a surprise as I had never noticed a considerable difference in my own work using duplicated. Indeed, for sizes that I work with everyday, there isn't a difference. Observe:

set.seed(1007)
s1 <- sample(10^2, 10^3, replace = TRUE)
s1_sort <- sort(s1)
library(microbenchmark)
microbenchmark(dp=duplicated(s1), dp_sort=duplicated(s1_sort), times=1000)
Unit: microseconds
   expr    min      lq     mean  median      uq      max neval cld
     dp 16.459 16.9425 22.06371 17.2965 22.5050 1541.137  1000   a
dp_sort 17.007 17.5005 25.54953 17.8200 23.3655 1549.198  1000   a

As you can see, there is no noticeable difference in timings when the vector is sorted. However, on very large vectors, the results are much different. Observe:

s2 <- sample(10^6, 10^7, replace = TRUE)
s2_sort <- sort(s2)
microbenchmark(dp=duplicated(s2), dp_sort=duplicated(s2_sort), times=100)
Unit: milliseconds
   expr      min       lq     mean   median       uq       max neval cld
     dp 816.6883 847.9231 869.6829 861.8210 882.3978 1019.6339   100   b
dp_sort 287.6779 305.4779 322.8830 315.1198 324.9249  449.1734   100  a

Almost 3x faster!!! This led me down the rabbit hole, which began here: r-source.../duplicated.R. From here we see that duplicated makes a call to .Internal(duplicated(x,...)). Then using the function pryr::show_c_source(.Internal(duplicated(x))) and the workaround suggested by @joran (show_c_source is currently giving issues.. see Is 'show_c_source()' borken?), we see that duplicated makes a call to do_duplicated. Finally, the heart of duplicated is revealed (It starts at line 667 and ends at 988). It appears that the entire vector is looped over and then some hashing occurs:

724     /* count unique entries */
725     k = 0;
726     for (i = 0; i < n; i++)
727         if (LOGICAL(dup)[i] == 0)
728             k++;

776     /* Build a hash table, ignoring information on duplication */
777     static void DoHashing(SEXP table, HashData *d)

I don't fully understand all of the code, but it seems like sorting shouldn't matter. We loop over the entire vector in either case (sorted vs. non-sorted) and ultimately call an assortment of hash functions, which shouldn't depend on whether a vector is sorted or not. My initial thought was that some sort of branch prediction was going on (see this question), but from the update to this answer, it seems that these things shouldn't matter any more.

What's going on??

EDIT

The gap seems to increase as both the size of the vector and the number of duplicates increases.

set.seed(496)
s3 <- sample(10^6, 10^8, replace = TRUE)
s3_sort <- sort(s3)
microbenchmark(dp=duplicated(s3), dp_sort=duplicated(s3_sort), times = 10)
Unit: seconds
   expr       min        lq      mean    median        uq       max neval cld
     dp 12.149932 12.175665 12.848843 12.495599 12.719861 15.589190    10   b
dp_sort  2.395636  2.401837  2.706674  2.551375  2.677556  4.373653    10  a

As @alexis_laz pointed out, if there are no duplicates, the impact of sorting is greatly reduced.

s4 <- sample(10^8)
s4_sort <- sort(s4)
microbenchmark(dp=duplicated(s4), dp_sort=duplicated(s4_sort), times = 10)
Unit: seconds
   expr      min       lq     mean   median       uq       max neval cld
     dp 8.013995 8.130565 8.593626 8.197501 8.438703 10.639452    10   b
dp_sort 6.135788 6.158140 6.751101 6.256739 7.241381  8.913507    10  a

I think you're missing the importance of line 717, `dup = Duplicated(x, fL, nmax);` in your ["heart of duplicated" link](https://github.com/wch/r-source/blob/6e7a2ed989027f3800d2e2d64e60e6d700034c6b/src/main/unique.c#L667). This seems to be the call that actually determines the duplicate status of each element. The "count unique entries" is just adding up the results `dup` of the `Duplicated` call. — Gregor Thomas, Dec 27 '16 at 21:16
Also, the "build a hash table" is the definition of `DoHashing` - it's not necessarily "what happens next", it's just a definition of a function. If you count your curly braces you'll see it's not a part of `do_duplicated`. — Gregor Thomas, Dec 27 '16 at 21:17
Not sure how relevant, but maybe the way the hash table is accessed internally could play a part? I tried (not sure if I've missed something) to copy/filter some code to return R's internal index when accessing its hash table -- `Rihash = inline::cfunction(sig = c(x = "integer"), body = ' int K = 1; size_t n = 2U * (size_t) LENGTH(x), M = 2; while(M < n) { M *= 2; K++; } SEXP ans = allocVector(INTSXP, LENGTH(x)); for(int i = 0; i < LENGTH(x); i++) INTEGER(ans)[i] = 3141592653U * (unsigned int) (INTEGER(x)[i]) >> (32 - K); return(ans); ')`. (cont..) — alexis_laz, Dec 27 '16 at 22:09
(..cont) If all done right, the above is the first index accessed, not the access after resolving collisions. Computing `hash_s2 = Rihash(s2); hash_s2_sort = Rihash(s2_sort)` and plotting with something like `matplot(cbind(hash_s2[1:100], hash_s2_sort[1:100]), type = "l")` (for the first few values) it seems (?) indeed that the memory access is smoother for the sorted vector. — alexis_laz, Dec 27 '16 at 22:11
@alexis_laz, that's great insight!!! It also seems to add to the mystery. — Joseph Wood, Dec 27 '16 at 22:22
btw, eliminating the duplicates (which are near each other in the sorted vector and result in the identical indices of hash table to be somewhat clustered) `s3 <- sample(10^7); s3_sort = sort(s3)` seems to, indeed, close the gap a bit `microbenchmark::microbenchmark(duplicated(s2), duplicated(s2_sort), duplicated(s3), duplicated(s3_sort), times = 10)` — alexis_laz, Dec 27 '16 at 23:37
Did you read this: http://stackoverflow.com/questions/11227809/why-is-it-faster-to-process-a-sorted-array-than-an-unsorted-array? — Tonio Liebrand, Feb 13 '17 at 18:24
@BigDataScientist, yeah that's a great post. I referenced that question along with the accepted answer in the last paragraph before the edit section. — Joseph Wood, Feb 13 '17 at 18:39

score 7 · Accepted Answer · answered Feb 16 '17 at 20:35

The major factor is the rate of CPU cache misses, and as size scales, more expensive page faults. Duplication is checked by reference to a simple hash table. If the portion of the hash table being queried is already in the high speed memory cache, then these lookups are much faster. For small vectors, the corresponding hash table will entirely fit into the high speed memory cache, so the order of access is not significant, which is what you saw in your first benchmark.

For larger vectors, only some blocks of the hash table will fit into the cache at any given time. If duplicates are consecutive, then the portion of the hash table needed for lookup will already be in the cache for the subsequent lookups. This is why performance increases by number of duplicates for larger vectors. For extremely large vectors, the hash table may not even entirely fit into available physical memory and be paged out to the disk, making the difference even more noticeable.

To test this out, let's use the original post's s2 vector and its sorted version, but also test out just having the duplicates next to each other but otherwise unsorted.

# samples as in original post
s2 <- sample(10^6, 10^7, replace = TRUE)
s2_sort <- sort(s2)

# in the same order as s2, but with duplicates brought together
u2 <- unique(s2)
t2 <- rle(s2_sort)
s2_chunked <- rep(u2,times=t2$length[match(u2,t2$values)])

Let's also consider just sorting by hash value. I'll approximate the hash coding in R, but we are dealing with double sized values here rather than being able to use unsigned longs so we won't be able to use bitwise ops.

# in the order of hash value
K <- ceiling(log2(length(s2)*2))
M <- 2^K
h <- ((3141592653 * s2) %% 2^32)/2^(32-K)
ho <- order(h)
s2_hashordered <- s2[ho]

What we expect to see is that performance is similar for s2_sort and s2_chunked and even better for s2_hashordered. In each of these cases we've attempted to minimize cache misses.

microbenchmark(
 duplicated(s2), 
 duplicated(s2_sort), 
 duplicated(s2_chunked),
 duplicated(s2_hashordered),
 times=10)

Unit: milliseconds
                       expr      min       lq     mean   median       uq      max neval cld
             duplicated(s2) 664.5652 677.9340 690.0001 692.3104 703.8312 711.1538    10   c
        duplicated(s2_sort) 245.6511 251.3861 268.7433 276.2330 279.2518 284.6589    10  b 
     duplicated(s2_chunked) 240.0688 243.0151 255.3857 248.1327 276.3141 283.4298    10  b 
 duplicated(s2_hashordered) 166.8814 169.9423 185.9345 185.1822 202.7478 209.0383    10 a

Thanks for this explanation. In the definition of `s2_hashordered`, what is the intended purpose of `M`, and what is the reason for multiplying `s2` by (a multiple of) `pi`? — egnha, Apr 06 '18 at 07:32

Why does R's duplicated perform better on sorted data?

EDIT

1 Answers1

Linked