4

The code below takes about 15 seconds to generate a vector of 10k UUIDs. I will need to generate 1M or more and I calculate that this will take 15 * 10 * 10 / 60 minutes, or about 25 minutes. Is there a faster way to achieve this?

library(uuid)
library(dplyr)
start_time <- Sys.time()
temp <- sapply( seq_along(1:10000), UUIDgenerate )
end_time <- Sys.time()
end_time - start_time  

# Time difference of 15.072 secs

Essentially, I'm searching for a method for R that manages to achieve the performance boost described here for Java: Performance of Random UUID generation with Java 7 or Java 6

They should be RFC 4122 compliant but the other requirements are flexible.

Bobby
  • 1,585
  • 3
  • 19
  • 42
  • 1
    Why are you answering to yourself ? – Jean Rostan Apr 09 '18 at 16:50
  • 4
    @JeanRostan there's nothing wrong with answering your own question, in fact it's encouraged – Hong Ooi Apr 09 '18 at 16:53
  • 1
    This is common on this site as far as I know. In fact, you'll see the option as a checkbox on the same screen where you can ask a question. In my case, while writing the question, I discovered this answer by trying out something new in my code. Then I decided to post the answer anyways in case it's useful for someone else. Anyone else can feel free to comment here in case I've misunderstood the site rules and best practices. – Bobby Apr 09 '18 at 16:54
  • 2
    @HongOoi Indeed, I didn't know that since I'm new. Thanks Bobby. Link for reference : https://stackoverflow.com/help/self-answer – Jean Rostan Apr 09 '18 at 16:55
  • 1
    `all(seq_along(1:10000) == 1:10000)` ... no need for `seq_along`, it's just slowing you down (approx doubles execution time). – r2evans Apr 09 '18 at 17:34

2 Answers2

8

Bottom line up front: no, there is currently no way to speed up generation of a lot of UUIDs with uuid without compromising the core premise of uniqueness. (Using uuid, that is.)

In fact, your suggestion to use use.time=FALSE has significantly bad ramifications (on windows). See below.

It is possible to get faster performance at scale, just not with uuid. See below.

uuid on Windows

Performance of uuid::UUIDgenerate should take into account the OS. More specifically, the source of randomness. It's important to look at performance, yes, where:

library(microbenchmark)
microbenchmark(
  rf=replicate(1000, uuid::UUIDgenerate(FALSE)),
  rt=replicate(1000, uuid::UUIDgenerate(TRUE)),
  sf=sapply(1:1000, function(ign) uuid::UUIDgenerate(FALSE)),
  st=sapply(1:1000, function(ign) uuid::UUIDgenerate(TRUE))
)
# Unit: milliseconds
#  expr       min        lq     mean   median       uq      max neval
#    rf  8.675561  9.330877 11.73299 10.14592 11.75467  66.2435   100
#    rt 89.446158 90.003196 91.53226 90.94095 91.13806 136.9411   100
#    sf  8.570900  9.270524 11.28199 10.22779 12.06993  24.3583   100
#    st 89.359366 90.189178 91.73793 90.95426 91.89822 137.4713   100

... so using use.time=FALSE is always faster. (I included the sapply examples for comparison with your answer's code, to show that replicate is never slower. Use replicate here unless you feel you need the numeric argument for some reason.)

However, there is a problem:

R.version[1:3]
#          _                 
# platform x86_64-w64-mingw32
# arch     x86_64            
# os       mingw32           
length(unique(replicate(1000, uuid::UUIDgenerate(TRUE))))
# [1] 1000
length(unique(replicate(1000, uuid::UUIDgenerate(FALSE))))
# [1] 20

Given that a UUID is intended to be unique each time called, this is disturbing, and is a symptom of insufficient randomness on windows. (Does WSL provide a way out for this? Another research opportunity ...)

uuid on Linux

For comparison, the same results on a non-windows platform:

microbenchmark(
  rf=replicate(1000, uuid::UUIDgenerate(FALSE)),
  rt=replicate(1000, uuid::UUIDgenerate(TRUE)),
  sf=sapply(1:1000, function(ign) uuid::UUIDgenerate(FALSE)),
  st=sapply(1:1000, function(ign) uuid::UUIDgenerate(TRUE))
)
#  Unit: milliseconds
#   expr       min       lq     mean   median       uq       max neval
#     rf 20.852227 21.48981 24.90932 22.30334 25.11449  74.20972   100
#     rt  9.782106 11.03714 14.15256 12.04848 15.41695 100.83724   100
#     sf 20.250873 21.39140 24.67585 22.44717 27.51227  44.43504   100
#     st  9.852275 11.15936 13.34731 12.11374 15.03694  27.79595   100

R.version[1:3]
# _
# platform x86_64-pc-linux-gnu
# arch     x86_64
# os       linux-gnu
length(unique(replicate(1000, uuid::UUIDgenerate(TRUE))))
# [1] 1000
length(unique(replicate(1000, uuid::UUIDgenerate(FALSE))))
# [1] 1000

(I'm slightly intrigued by the fact that use.time=FALSE on linux takes twice as long as on windows ...)

UUID generation with a SQL server

If you have access to a SQL server (you almost certainly do ... see SQLite ...), then you can deal with this scale problem by employing the server's implementation of UUID generation, recognizing that there are some slight differences.

(Side note: there are "V4" (completely random), "V1" (time-based), and "V1mc" (time-based and includes the system's mac address) UUIDs. uuid gives V4 if use.time=FALSE and V1 otherwise, encoding the system's mac address.)

Some performance comparisons on windows (all times in seconds):

#         n  uuid postgres sqlite sqlserver
# 1     100     0     1.23   1.13      0.84
# 2    1000  0.05     1.13   1.21      1.08
# 3   10000  0.47     1.35   1.45      1.17
# 4  100000  5.39     3.10   3.50      2.68
# 5 1000000 63.48    16.61  17.47     16.31

The use of SQL has some overhead that does not take long to overcome when done at scale.

  • PostgreSQL needs the uuid-ossp extension, installable with

    CREATE EXTENSION "uuid-ossp"
    

    Once installed/available, you can generate n UUIDs with:

    n <- 3
    pgcon <- DBI::dbConnect(...)
    DBI::dbGetQuery(pgcon, sprintf("select uuid_generate_v1mc() as uuid from generate_series(1,%d)", n))
    #                                   uuid
    # 1 53cd17c6-3c21-11e8-b2bf-7bab2a3c8486
    # 2 53cd187a-3c21-11e8-b2bf-dfe12d92673e
    # 3 53cd18f2-3c21-11e8-b2bf-d3c64c6ad73f
    

    Other UUID functions exists. https://www.postgresql.org/docs/9.6/static/uuid-ossp.html

  • SQLite includes limited ability to do it, but this hack works well enough for a V4-style UUID (length n):

    sqlitecon <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") # or your own
    DBI::dbGetQuery(sqlitecon, sprintf("
            WITH RECURSIVE cnt(x) as (
              select 1 union all select x+1 from cnt limit %d
            )
            select (hex(randomblob(4))||'-'||hex(randomblob(2))||'-'||hex(randomblob(2))||'-'||hex(randomblob(2))||'-'||hex(randomblob(6))) as uuid
            from cnt", n))
    #                                   uuid
    # 1 EE6B08DA-2991-BF82-55DD-78FEA48ABF43
    # 2 C195AAA4-67FC-A1C0-6675-E4C5C74E99E2
    # 3 EAC159D6-7986-F42C-C5F5-35764544C105
    

    This takes a little pain to format it the same, a nicety at best. You might find small performance improvements by not clinging to this format.)

  • SQL Server requires temporarily creating a table (with newsequentialid()), generating a sequence into it, pulling the automatically-generated IDs, and discarding the table. A bit over-the-top, especially considering the ease of using SQLite for it, but YMMV. (No code offered, it doesn't add much.)

Other considerations

In addition to execution time and sufficient-randomness, there are various discussions around (uncited for now) with regards to database tables that indicate performance impacts by using non-consecutive UUIDs. This has to do with index pages and such, outside the scope of this answer.

However, assuming this is true ... with the assumption that rows inserted at around the same time (temporally correlated) are often grouped together (directly or sub-grouped), then it is a good thing to keep same-day data with UUID keys in the same db index-page, so V4 (completely random) UUIDs may decrease DB performance with large groups (and large tables). For this reason, I personally prefer V1 over V4.

Other (still uncited) discussions consider including a directly-traceable MAC address in the UUID to be a slight breach of internal information. For this reason, I personally lean towards V1mc over V1.

(But I don't yet have a way to do this well with RSQLite, so I'm reliant on having postgresql nearby. Fortunately, I use postgresql enough for other things that I keep an instance around with docker on windows.)

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • The answer looks quite complete, thanks! Do you have any experience with the Byte Code Compiler and do you think it might help here? https://channel9.msdn.com/Events/useR-international-R-User-conferences/useR-International-R-User-2017-Conference/Taking-Advantage-of-the-Byte-Code-Compiler – Bobby Apr 10 '18 at 21:34
  • 1
    I've used it, and I'm not confident that it will give you any speedups using `uuid`: the byte-compiler works on R code, but C code is almost always faster. The reason that `UUIDgenerate` is so slow on windows is due to low available entropy. Though you might come up with something fast in (say) `Rcpp`, the `libuuid` library goes to significant effort to ensure uniqueness, something your custom code would have to duplicate. I think the best place for speedup would be if `libuuid` (not `uuid`) allowed a request for `n` UUIDs at once instead of looping. (And better entropy in windows.) – r2evans Apr 11 '18 at 13:51
4

Providing the option use.time will significantly speed up the process. It can be set to either TRUE or FALSE, to determine if the UUIDs are time-based or not. In both cases, it will be significantly faster than not specifying this option.

For 10k UUIDs,

library(uuid)
library(dplyr)

start_time <- Sys.time()
temp <- sapply( seq_along(1:10000), function(ign) UUIDgenerate(FALSE) )
end_time <- Sys.time()
end_time - start_time
# 10k: 0.01399994 secs

start_time <- Sys.time()
temp <- sapply( seq_along(1:10000), function(ign) UUIDgenerate(TRUE)  )
end_time <- Sys.time()
end_time - start_time
# 10k: 0.01100016 secs

Even scaling up to 100M, still gives a faster run-time than the original 15 seconds.

start_time <- Sys.time()
temp <- sapply( seq_along(1:100000000), function(ign) UUIDgenerate(FALSE)  )
end_time <- Sys.time()
end_time - start_time
# 100M: 1.154 secs

start_time <- Sys.time()
temp <- sapply( seq_along(1:100000000), function(ign) UUIDgenerate(TRUE)  )
end_time <- Sys.time()
end_time - start_time
# 100M: 3.7586 secs
Bobby
  • 1,585
  • 3
  • 19
  • 42
  • 1
    Is that support to be `use.time=FALSE` instead of `use.time, FALSE`? – r2evans Apr 09 '18 at 17:30
  • If this is on windows, you're outta-luck, unfortunately. Since windows does not have (easy) access to `/dev/urandom`, it does not have a "sufficiently-random" source. (It is enlightening to see `length(unique(replicate(1000, uuid::UUIDgenerate(FALSE))))`, showing `uuid` on windows will always be slow if uniqueness is required, which it should be. Discussion on github: https://github.com/s-u/uuid/issues/2.) – r2evans Apr 09 '18 at 17:40
  • Bobby, pls see my answer. The recommendation to use `use.time=FALSE` has serious consequences for windows users and should be (in my opinion) avoided there. – r2evans Apr 09 '18 at 18:29
  • @r2evans Thanks for your detailed answer. I'll study it tomorrow when I'm again at my windows computer at work. I do have one question at the moment in case you could help. I was able to execute the code above without problems and got those dramatic performance differences. Then upon restarting my windows computer, I ran the same code and got this error: `Error in lapply(X = X, FUN = FUN, ...) : object use.time not found`. I must have made another mistake. Or perhaps this is related to your `use.time=FALSE` suggestion. I thought my way above is the correct syntax for `sapply`. – Bobby Apr 09 '18 at 19:28
  • @r2evans And when I switch to `use.time= FALSE`, I get a different error: `Error in FUN(X[[i]], ...) : unused argument (X[[i]])` – Bobby Apr 09 '18 at 19:33
  • 1
    Looking at `args(UUIDgenerate)` sapply is not suitable. If you insist on using it, an anonymous function will make it workable, as in r2e's answer. Your code should also give an error (try it in a new session) but doesn't because you have an object called `use.time` in the global environment, apparently. – Frank Apr 09 '18 at 19:38
  • 1
    The `unused argument` error makes sense: you're doing it wrong. The problem is that you are implicitly providing the number (from `1:100..`) as the first argument. If you do `sapply(1:100, function(ign) UUIDgenerate(FALSE))`, it'll work. See sample code in my answer. (This might be a misunderstanding of using `sapply` with anonymous and named functions.) – r2evans Apr 09 '18 at 19:46
  • 1
    Bobby, realize that `sapply(1:10, func)` is identical to `sapply(1:30, function(i) func(i))`. If you want to ignore the sequenced number, your three options are: (1) write the function to explicitly ignore it, not possible if using a pre-written function; (2) explicitly ignore it with `sapply(1:10, function(ign) func())`; or (3) use `replicate(10, func())`. The third is much clearer in its intention to ignore the number, and so is often preferred. (Note that `replicate` calls `sapply` internally.) – r2evans Apr 09 '18 at 19:49
  • 1
    Side notes: (1) don't need `library(dplyr)`; (2) can use `system.time(...)` instead of the `Sys.time();...;Sys.time()` triple-call. – r2evans Apr 09 '18 at 19:54