0

I would like to generate unique IDs for rows in my database. I will be adding entries to this database on an ongoing basis so I'll need to generate new IDs in tandem. While my database is relatively small and the chance of duplicating random IDs is minuscule, I still want to build in a programmatic fail-safe to ensure that I never generate an ID that has already been used in the past.

For starters, here are some sample data that I can use start an example database:

library(tidyverse)
library(ids)
library(babynames)
    
database <- data.frame(rid = random_id(5, 5), first_name = sample(babynames$name, 5))

print(database)
          rid first_name
1  07282b1da2      Sarit
2  3c2afbb0c3        Aly
3  f1414cd5bf    Maedean
4  9a311a145e    Teriana
5  688557399a    Dreyton

And here is some sample data that I can use to represent new data that will be appended to the existing database:

new_data <- sample(babynames$name, 5)

print(new_data)

 first_name
1    Hamzeh
2   Mahmoud
3   Matelyn
4    Camila
5     Renae

Now, what I want is to bind a new column of randomly generated IDs using the random_id function while simultaneously checking to ensure that newly generated IDs don't match any existing IDs within the database object. If the generator created an identical ID, then ideally it would generate a new replacement until a truly unique ID is created.

Any help would be much appreciated!

UPDATE

I've thought of a possibility that helps but still is limited. I could generate new IDs and then use a for() loop to test whether any of the newly generated IDs are present in the existing database. If so, then I would regenerate a new ID. For example...

new_data$rid <- random_id(nrow(new_data), 5)

for(i in 1:nrow(new_data)){
  if(new_data$rid[i] %in% unique(database$rid)){
    new_data$rid[id] = random_id(1, 5)
  }
}

The problem with this approach is that I would need to build an endless stream of nested if statements to continuously test the newly generated value against the original database again. I need a process to keep testing until a truly unique value that is not found in the original database is generated.

Trent
  • 771
  • 5
  • 19
  • Maybe you're looking for the `ids::uuid()` function. https://cran.r-project.org/web/packages/ids/vignettes/ids.html – manotheshark Sep 29 '20 at 22:00
  • @manotheshark, I don't see how the `ids::uuid()` function would solve my problem. It's just a function for generating a different type of ID and it wouldn't necessarily check newly generated IDs against a preexisting vector of IDs. – Trent Sep 29 '20 at 22:11

2 Answers2

5

Use of ids::uuid() would likely preclude having to check for duplicate id values. In fact, if you were to generate 10 trillion uuids, there would be something along the lines of a .00000006 chance of two uuids being the same per What is a UUID?

Here is a base function that will quickly check for duplicate values without needing to do any iteration:

anyDuplicated(1:4)
[1] 0

anyDuplicated(c(1:4,1))
[1] 5

The first result above shows there are no duplicate values. The second is showing that element 5 is a duplicate as 1 is used twice. Below is how to check without iterating, the new_data had the database$rid copied so all five were duplicates. This will repeat until all rid are unique, but note that it presumes that all existing database$rid are unique.

library(ids)
set.seed(7)
new_data$rid <- database$rid
repeat {
  duplicates <- anyDuplicated(c(database$rid, new_data$rid))
  if (duplicates == 0L) {
    break
  }
  new_data$rid[duplicates - nrow(database)] <- random_id(1, 5)
}

All new_data$rid have been replaced with unique values.

rbind(database, new_data)

          rid first_name
1  07282b1da2      Sarit
2  3c2afbb0c3        Aly
3  f1414cd5bf    Maedean
4  9a311a145e    Teriana
5  688557399a    Dreyton
6  52f494c714     Hamzeh
7  ac4f522860    Mahmoud
8  ffe74d535b    Matelyn
9  e3dccc4a8e     Camila
10 e0839a0d34      Renae
manotheshark
  • 4,297
  • 17
  • 30
  • While this is interesting and poses a *workaround* to my solution, it doesn't necessarily solve the original intent of the problem I posed. You've provided an opportunity to lessen the chance of collisions and replacing duplicates. But, the point is to generate new IDs while excluding old ones. Your process still does not capture the case where a replaced duplicate also matches a previously generated ID. – Trent Sep 30 '20 at 15:15
  • The IDs I generate will be frequently typed by other people. So, I have to compromise a bit and decrease the length of the unique ID, increasing the chances of collision in ID generation. This means that I really need a process that is somewhat iterative in fashion to validate new IDs against an old vector of previously generated IDs and create new ones until a truly unique ID is created. – Trent Sep 30 '20 at 15:20
  • @Craig why let the user create an ID if you're going to programmatically replace it if there is a duplicate entry? Everything beyond the first paragraph is focused on your question. The final example was expanded to repeat until there are no duplicate entries in `new_data$rid` – manotheshark Sep 30 '20 at 16:35
  • I never said the user would create the IDs, only that they would be typed frequently. Thank you for expanding your answer; this is exactly what I need. – Trent Sep 30 '20 at 18:29
1

This answer is inspired by @manotheshark's answer, with 2 major changes:

  1. It's a function.
  2. I changed the mechanism of replacing the duplicates. Instead of looping and replacing one duplicate in each iteration as in @manotheshark's, here I replace them in bigger chunks.
library(ids)

generate_random_unique_ids <- function(n) {
  vec_ids <- ids::random_id(n = n, bytes = 4, use_openssl = FALSE)
  repeat {
    duplicates <- duplicated(vec_ids)
    if (!any(duplicates)) {
      break
    }
    vec_ids[duplicates] <- ids::random_id(n = sum(duplicates), bytes = 4, use_openssl = FALSE)
  }
  vec_ids
}

Some timings for example

library(tictoc)

tic()
v_1e6 <- generate_random_unique_ids(1e6)
toc()
#> 7.14 sec elapsed

tic()
v_3e7 <- generate_random_unique_ids(3e7)
toc()
#> 296.42 sec elapsed

Would love to learn if there's a way to optimize this function to get speedier execution times.

Emman
  • 3,695
  • 2
  • 20
  • 44