Handling close-to-impossible collisions on should-be-unique values

Question

There are many systems that depend on the uniqueness of some particular value. Anything that uses GUIDs comes to mind (eg. the Windows registry or other databases), but also things that create a hash from an object to identify it and thus need this hash to be unique.

A hash table usually doesn't mind if two objects have the same hash because the hashing is just used to break down the objects into categories, so that on lookup, not all objects in the table, but only those objects in the same category (bucket) have to be compared for identity to the searched object.

Other implementations however (seem to) depend on the uniqueness. My example (that's what lead me to asking this) is Mercurial's revision IDs. An entry on the Mercurial mailing list correctly states

The odds of the changeset hash colliding by accident in your first billion commits is basically zero. But we will notice if it happens. And you'll get to be famous as the guy who broke SHA1 by accident.

But even the tiniest probability doesn't mean impossible. Now, I don't want an explanation of why it's totally okay to rely on the uniqueness (this has been discussed here for example). This is very clear to me.

Rather, I'd like to know (maybe by means of examples from your own work):

Are there any best practices as to covering these improbable cases anyway?
Should they be ignored, because it's more likely that particularly strong solar winds lead to faulty hard disk reads?
Should they at least be tested for, if only to fail with a "I give up, you have done the impossible" message to the user?
Or should even these cases get handled gracefully?

For me, especially the following are interesting, although they are somewhat touchy-feely:

If you don't handle these cases, what do you do against gut feelings that don't listen to probabilities?
If you do handle them, how do you justify this work (to yourself and others), considering there are more probable cases you don't handle, like a supernonva?

There is also a non-zero probability that you do quantum tunneling through your chair and fall on the floor, but putting a pillow underneath is overkill. It strongly depends on what you are doing. If you are developing a tunnel microscope, then the unexpected and improbable is what you want to handle (especially because at that scale it becomes not negligible). It's technically more probable to face out of memory cases than SHA collisions, but I've never seen code handling OOM seriously. — Stefano Borini, Apr 07 '10 at 05:29
Here, for example, is an example where MSFT checking for collisions in the GUID space caused a bug in SQL Server that had to be hotfixed in Windows 2000. — corprew, Apr 07 '10 at 05:16
The recent [OpenSSL vulnerability](http://www.ubuntu.com/usn/usn-612-1) would probably have been detected much earlier if developers had included some testing code. Obviously it shouldn't attempt to run through all possible sources, but you would get a pretty good idea of the odds if it runs a million iterations without warning. Knowledge is better than faith. — l0b0, Jul 08 '09 at 13:02

Steve Jessop · Accepted Answer · 2009-07-08T13:24:35.200

7

If you do handle them, how do you justify this work (to yourself and others), considering there are more probable cases you don't handle, like a supernova?

The answer to that is you aren't testing to spot a GUID collision occurring by chance. You're testing to spot a GUID collision occurring because of a bug in the GUID code, or a precondition that the GUID code relies on that you've violated (or been tricked into violating by some attacker), such as in V1 that MAC addresses are unique and time goes forward. Either is considerably more likely than supernova-based bugs.

However, not every client of the GUID code should be testing its correctness, especially in production code. That's what unit tests are supposed to do, so trade off the cost of missing a bug that your actual use would catch but the unit tests didn't, against the cost of second-guessing your libraries all the time.

Note also that GUIDs only work if everyone who is generating them co-operates. If your app generates the IDs on machines you countrol, then you might not need GUIDs anyway - a locally unique ID like an incrementing counter might do you fine. Obviously Mercurial can't use that, hence it uses hashes, but eventually SHA-1 will fall to an attack that generates collisions (or, even worse, pre-images), and they'll have to change.

If your app generates non-hash "GUIDs" on machines you don't control, like clients, then forget about accidental collisions, you're worried about deliberate collisions by malicious clients trying to DOS your server. Protecting yourself against that will probably protect you against accidents anyway.

Or should even these cases get handled gracefully?

The answer to this is probably "no". If you could handle colliding GUIDs gracefully, like a hashtable does, then why bother with GUIDs at all? The whole point of an "identifier" is that if two things have the same ID, then they're the same. If you don't want to treat them the same, just initially direct them into buckets like a hashtable does, then use a different scheme (like a hash).

edited Jul 08 '09 at 13:24

answered Jul 08 '09 at 13:09

Steve Jessop

273,490
39
460
699

+1 Interesting, I hadn't even considered a bug as reason for collision. – balpha Jul 08 '09 at 13:17
1

MAC addresses aren't always unique either, there have been cases where a bunch of cheap knockoffs had the same MAC addresses. – Brad Gilbert Jul 08 '09 at 16:44
+1 In the vast majority of cases a collision on a 128 bit hash is far more likely to be a bug or attack than an accidental collision. – Aaron Maenpaa Jul 08 '09 at 19:38
@Brad : I heard of hardware allowing to change the MAC address. In any case, you can configure virtual interfaces with any MAC. If you are running a bunch of virtual machines, all with the same interface setup for whatever sadistic reason, you are going to seed through the same MAC address on different machines. – Stefano Borini Apr 07 '10 at 05:25
Good analysis, but I disagree somewhat with your conclusion. 4 bits are used for the GUID version field and 2 bits are used for the variant field, so you really have only 122 bits. With 2^122, the possibility of GUID collisions still seems low until you consider a very large population of users (say 500,000,000 Facebook users, or 1B Google users) who may be performing actions that are resulting in the creation of new GUIDs at a rate of 1-2 per second. – Sep 16 '10 at 08:07
Over the period of just 2 years, your system has only an 86% chance of no collisions, after 4 years you're already down to 50% chance of no collisions. I exaggerated this example so it might seem contrived but it shows that the scale of large websites today (esp. social portals) may be entering a regime where GUIDs are not as useful as everyone thinks. – Sep 16 '10 at 08:07
Additionally, if the web moves in the direction of large scale real-time personal data collection and/or continuous streaming of biometric sensor data, this increased rate of GUID generation times some huge number of users means your graph gets shifted even further to the left. I like the elegance that GUIDs provide, but I predict that in the next decade, for new emerging applications, they will have outlived their utility. – Sep 16 '10 at 08:07

score 4 · Answer 2 · answered Jul 08 '09 at 19:32

Given a good 128 bit hash, the probably of colliding with a specific hash value given a random input is:

1 / 2 ** 128 which is approximately equal to 3 * 10 ** -39.

The probability of seeing no collisions (p) given n samples can be computed using the logic used to explain the birthday problem.

p = (2 ** 128)! / (2 ** (128 * n) * (2 ** 128 - n)!)

where !denotes the factorial function. We can then plot the probability of no collisions as the number of samples increases:

Probability of a random SHA-1 collision as the number of samples increases. http://img21.imageshack.us/img21/9186/sha1collision.png

Between 10**17 and 10**18 hashes we begin to see non-trivial possibilities of collision from 0.001% to 0.14% and finally 13% with 10**19 hashes. So in a system with a million, billion, records counting on uniqueness is probably unwise (and such systems are conceivable), but in the vast majority of systems the probability of a collision is so small that you can rely on the uniqueness of your hashes for all practical purposes.

Now, theory aside, it is far more likely that collisions could be introduced into your system either through bugs or someone attacking your system and so onebyone's answer provides good reasons to check for collisions even though the probability of an accidental collision are vanishingly small (that is to say the probability of bugs or malice is much higher than an accidental collision).

Handling close-to-impossible collisions on should-be-unique values

2 Answers2