Collision probability of ObjectId vs UUID in a large distributed system

Question

Considering that an UUID rfc 4122 (16 bytes) is much larger than a MongoDB ObjectId (12 bytes), I am trying to find out how their collision probability compare.

I know that is something around quite unlikely, but in my case most ids will be generated within a large number of mobile clients, not within a limited set of servers. I wonder if in this case, there is a justified concern.

Compared to the normal case where all ids are generated by a small number of clients:

It might take months to detect a collision since the document creation
IDs are generated from a much larger client base
Each client has a lower ID generation rate

Why are you allowing mobile clients to create the ObjectIds, or any permanent Id if you're concerned data integrity? — WiredPrairie, Mar 24 '14 at 10:43
The clients might be offline and store information which might not be synced for a long time. I do not want to force a 100% online mobile app — SystematicFrank, Mar 24 '14 at 10:45
@WiredPrairie most **client** library implementations create the `_id` value by default. Not saying it is a "good idea" to connect directly. But "ObjectId" generation is perfectly valid. — Neil Lunn, Mar 24 '14 at 10:47
Personally, I wouldn't build or design a system that allowed clients to do this. I'd assign them temporary Ids when offline. I'd consider no different than expecting a client doesn't directly write to MongoDb without going through a data validation layer. — WiredPrairie, Mar 24 '14 at 10:48
@NeilLunn-I know most clients do that by default. An offline client could still create them, but they would need to be validated before inserting into the collection. — WiredPrairie, Mar 24 '14 at 10:51
I would like to avoid not using ObjectId since MongoDB optimizes this data type. Sometimes it even required them the past (Aggregation framework?) — SystematicFrank, Mar 24 '14 at 10:57
I really like the content from @mnemosyn, if for not much else than it does make the **monotonic** point that is inherent in the design of the spec. As such I also found this a *fair* question to ask and a very valid consideration for others to see in the future. The overall point says the "collision factor" is a like more than "quite unlikely". — Neil Lunn, Mar 24 '14 at 11:01
All this made me reconsider the role of UUIDs for offline clients. @WiredPrairie comment on temporal ids waiting for validation layer seems to be better future proof than just relying on UUIDs, but also such a pain to implement... well, partition tolerance was never a piece of cake. Thanks for the "birthday problem" mention. — SystematicFrank, Mar 24 '14 at 11:09
@SystematicFrank - Why not allow the clients to cache the entry with their own id system, then when they're back on line, write the cache to the remote and let the remote decide the id, that way no collision risk. I don't think there's a way to guarantee no collisions. — Scuba Steve, Mar 16 '22 at 20:33

score 38 · Accepted Answer · answered Mar 24 '14 at 10:46

in my case most ids will be generated within a large number of mobile clients, not within a limited set of servers. I wonder if in this case, there is a justified concern.

That sounds like very bad architecture to me. Are you using a two-tier architecture? Why would the mobile clients have direct access to the db? Do you really want to rely on network-based security?

Anyway, some deliberations about the collision probability:

Neither UUID nor ObjectId rely on their sheer size, i.e. both are not random numbers, but they follow a scheme that tries to systematically reduce collision probability. In case of ObjectIds, their structure is:

4 byte seconds since unix epoch
3 byte machine id
2 byte process id
3 byte counter

This means that, contrary to UUIDs, ObjectIds are monotonic (except within a single second), which is probably their most important property. Monotonic indexes will cause the B-Tree to be filled more efficiently, it allows paging by id and allows a 'default sort' by id to make your cursors stable, and of course, they carry an easy-to-extract timestamp. These are the optimizations you should be aware of, and they can be huge.

As you can see from the structure of the other 3 components, collisions become very likely if you're doing > 1k inserts/s on a single process (not really possible, not even from a server), or if the number of machines grows past about 10 (see birthday problem), or if the number of processes on a single machine grows too large (then again, those aren't random numbers, but they are truly unique on a machine, but they must be shortened to two bytes).

Naturally, for a collision to occur, they must match in all these aspects, so even if two machines have the same machine hash, it'd still require a client to insert with the same counter value in the exact same second and the same process id, but yes, these values could collide.

Yeah... If I hadn't taken the time to grab that coffee... :( — mnemosyn, Mar 24 '14 at 10:49
The mobile clients will not have direct access to the database, indeed, they can run even without a connection to it. However each mobile client will have to upload documents to the main database sooner or later. — SystematicFrank, Mar 24 '14 at 10:51
To be fair I'm sure I poured wine in my time-zone. Doesn't matter as long as the point gets across. — Neil Lunn, Mar 24 '14 at 10:56
There are perfectly valid cases for generating IDs from a client, and it doesn't imply access to the database at all. If you're doing this you definitely must not use ObjectIds, since it will have serious collision issues if you have tens or hundreds of thousands of clients generating them. I don't trust ObjectIds, since it's too easy to find cases, even if they require specific conditions, where collisions can occur. — Glenn Maynard, Sep 17 '14 at 16:48
@mnemosyn Helpful answer, however, I don't understand why 1k inserts/s per process can already be a problem. You would think that the counter increments by 1 on each "request" within the same second and resets to zero at the beginning of the next second. But with 3 bytes, you can represent much larger numbers than 1k. What am I missing here? — alapeno, Nov 28 '15 at 16:34

score 15 · Answer 2 · answered Mar 24 '14 at 10:44

Let's look at the spec for "ObjectId" from the documentation:

Overview

ObjectId is a 12-byte BSON type, constructed using:

a 4-byte value representing the seconds since the Unix epoch,

a 3-byte machine identifier,

a 2-byte process id, and

a 3-byte counter, starting with a random value.

So let us consider this in the context of a "mobile client".

Note: The context here does not mean using a "direct" connection of the "mobile client" to the database. That should not be done. But the "_id" generation can be done quite simply.

So the points:

Value for the "seconds since epoch". That is going to be fairly random per request. So minimal collision impact just on that component. Albeit in "seconds".
The "machine identifier". So this is a different client generating the _id value. This is removing possibility of further "collision".
The "process id". So where that is accessible to seed ( and it should be ) then the generated _id has more chance of avoiding collision.
The "random value". So another "client" somehow managed to generate all of the same values as above and still managed to generate the same random value.

Bottom line is, if that is not a convincing enough argument to digest, then simply provide your own "uuid" entries as the "primary key" values.

But IMHO, that should be a fair convincing argument to consider that the collision aspects here are very broad. To say the least.

The full topic is probably just a little "too-broad". But I hope this moves consideration a bit more away from "Quite unlikely" and on to something a little more concrete.

Collision probability of ObjectId vs UUID in a large distributed system

2 Answers2

Overview

Linked