0

I'm looking for a method to generate unique strings as primary keys for my database. I know the collision probability of GUID is very low, but I'm wondering if it is possible to use it in order to get keys that are 100% unique (99.99 uniqueness is not enough ;)

I'm using ASP.NET Core v5, Entity Framework Core v5.0.1, SQL Server.

Note that I want un-guessable ID-s (identity(1,1) is not a solution for me at the moment)

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
Dorin Baba
  • 1,578
  • 1
  • 11
  • 23
  • 7
    Unique in a single table? Add a unique constraint, and loop if a collision occurs, which it won't - job done. Unique in the world? No. – Marc Gravell Dec 28 '20 at 21:03
  • GUID is enough. do not worry about it. – urlreader Dec 28 '20 at 21:04
  • 3
    Guaranteed unique and un-guessable are *two completely separate things*. Unique means not equal to anything else. Un-guess-able is of course *impossible*, but the way to make something the hardest to guess is to make it cryptographically random, which, by definition, will never be 100% unique (but you can pretty easily make the likelihood lower than the likelihood of the planet exploding, which is probably actually good enough for you, even if you think otherwise). – Servy Dec 28 '20 at 21:05
  • @Servy I mean I could use for example identity(1,1) and all of the ID will be 100% unique, the downsize is that it will be easy to guess those IDs. I need something that works as identity (will never repeat themselves) but without a guessable pattern ;) – Dorin Baba Dec 28 '20 at 21:08
  • @DorinBaba You can DES your bigint autoincrement identity. DES will map any 8 bytes to another 8 bytes in a reversible manner. I named DES because it has 8 bytes block. DES is a bijiective function, so no collisions. – xanatos Dec 28 '20 at 21:11
  • 1
    @DorinBaba Yes, and I'm telling you that you have to choose one or the other. The way to make it 100% guaranteed always unique is to use a predictable pattern that you can be sure doesn't repeat. The way to make something as hard to guess as possible is to use a cryptographic strength random number generator, which is never *technically* impossible to repeat, just an arbitrarily low possibility. That is unless you have a single entity responsible for verifying uniqueness by comparing against every other existing value (i.e. a unique constraint on a column full of cryptographically random data. – Servy Dec 28 '20 at 21:11
  • @xanatos But if anyone knows that that's your strategy they can pretty easily guess the values by hashing values up from zero. If this is a security system, is should be safe even if malicious users know how it works. – Servy Dec 28 '20 at 21:12
  • @Servy If you want total security, you can use AES, but the block is 16 bytes. – xanatos Dec 28 '20 at 21:13
  • @xanatos But if you're just incrementing values then they don't need to reverse engineer the encoded value, they just start with 0 themselves and increment, hashing as they go, getting valid values every time. – Servy Dec 28 '20 at 21:14
  • @Servy You didn't comprehend... When you insert a new record you ask the SQL the next number in BIGINT sequence X (we will call it N), then you encrypt N with DES/AES, generating a new number N2 (technically a sequence of bytes, but still...). Then you use N2 as primary key. Yes, you are using CBC that will reduce the security of everything... – xanatos Dec 28 '20 at 21:16
  • 1
    @xanatos You didn't comprehend. If I want to guess the value of an item in your sequence, or the next value to be created, I just start counting from 0, generating the values of N, hash the values myself, generating N2, and know I know the value that I wasn't supposed to be able to guess and I didn't need to try all of the options. – Servy Dec 28 '20 at 21:19
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/226530/discussion-between-xanatos-and-servy). – xanatos Dec 28 '20 at 21:20
  • Whatever, I wouldn't mix unique key creation with unique "secret" creation. Use key values that can easily be indexed and sorted, store the "secret" in another column. – Gert Arnold Dec 28 '20 at 21:34

2 Answers2

4

The probability of a collision with a random GUID is 0, for all intents and purposes. Even if you invented a true 100% collision-free ID, the probability of a collision wouldn't be any lower in practice, because the probability of there being a bug in your ID generator or a glitch in your computer hardware caused by a cosmic ray that would produce a collision despite your generated ID would be just as significant as the chance of a GUID collision.

To estimate the probability of a GUID collison, take n to be the number of rows in your database. A random GUID has m = 122 random bits, so the probability of at least one collision in your database is

p(n) = 1 - (1-1/m)(1-2/m)...(1-(n-1)/m)
     ≈ n^2 / (2m)

Suppose n = 1,000,000,000. In that case

p(n) ≈ (10^9)^2 / (2 * 2^122)
     ≈ 9.4 x 10^-20

The probability of having a RAM error (even with ECC) within 72 hours is astronomically higher!

So the answer is: a GUID is as collision-safe as you can possibly get on a real physical computer in a real physical universe.

Mo B.
  • 5,307
  • 3
  • 25
  • 42
  • No. [Birthday Paradox. With 2^64 guid you have a 50% of having a collision](https://stackoverflow.com/a/185332/613130) – xanatos Dec 28 '20 at 21:26
  • @xanatos The formula I provided above *is* the approximation for the birthday problem. – Mo B. Dec 28 '20 at 21:27
  • @xanatos See here: https://en.wikipedia.org/wiki/Birthday_problem#Square_approximation – Mo B. Dec 28 '20 at 21:28
  • The birthday paradox is around 50% when you are around sqrt(total cases), sqrt(2^128) = 2^64, that is lower than the atoms in the universe (10^78 to 10^82 atoms) – xanatos Dec 28 '20 at 21:28
  • @xanatos That does depend on whether users can generate arbitrary numbers of values, or if they're simply trying to guess at a small (ish) number of values. If there are a few million or billion values that users are trying to guess, rather than users generating an arbitrarily large number of new values, all of which need to be unique and unguessable, then yeah, they could get up to 2^64 values and collisions become a real issue. It seems like the former is the use case though. – Servy Dec 28 '20 at 21:33
  • 2
    Of course it also depends on the quality of the generator. My guid generator has 100% chance of collisions :). – Gert Arnold Dec 28 '20 at 21:37
  • And in any case the probability small, and a duplicate can be absolutely prevented with a unique index. So in the _unlikely_ event a duplicate is generated, it's a simple runtime error. – David Browne - Microsoft Dec 29 '20 at 00:38
  • So I've been lucky enough (unlucky depending on your point of view) to see two separate GUID collisions in two seperate production systems at different companies. One of those systems made the assumption that a collision can never occur, so we had a downtime event and data corruption which had to be manually repaired. It's an exceptionally rare event but it can happen. – Sean Amos Feb 06 '21 at 19:15
1

The only way to GUARANTEE uniqueness without a sequential identity column is to look for the ID in the table first, any generate a new ID if it already exists. That's not a free operation, of course, so you need to weigh the performance hit of that check versus the (infinitesimally small) probability of a collision. You might be better off just trying an insert, and if a collision occurs, catching the error and trying again.

D Stanley
  • 149,601
  • 11
  • 178
  • 240
  • Well, the reason I've asked this question is to avoid this step. I guess I'm not the only one having this problem :) I think I will go with this approach. Thank you! – Dorin Baba Dec 28 '20 at 21:30
  • @DorinBaba As I explained below, this doesn't really lower the probability of there being a collision (unless your database is astronomically large). It certainly doesn't *guarantee* uniqueness since you may still have a bug in your collision checking code, or there may be a bug in the database engine or in the hardware. Since all these probabilities are higher than the chance of a GUID collision, the total probability of a collision doesn't actually decrease with this approach. – Mo B. Dec 28 '20 at 21:41