1

My web apps run database-agnostically, either on MongoDB, or any SQL database

I want a single strategy for generating all the unique IDs in the whole system. User IDs, messages, forum posts, chat messages — everything — and I want the IDs to provide zero information (eg, no timestamps)

My current plan:

  • generate random bits with a crypto-secure function
  • use 256 bits for enough entropy to avoid collisions — probability chart on wikipedia
  • represent these IDs as 64-character hexadecimal strings in app code
  • use hex instead of base64 to avoid most naughty words
    • also without word-break characters, hex is more easily selectable by double-clicking

Example ID: 402208a6d3295aad235c68cb20a35c30e835344bbc40fb398744c593b6aea076

My questions:

  • are these IDs too long, perhaps causing unnecessary performance problems?
  • are these IDs too short, perhaps encountering collisions that might cause bugs?
    • under some circumstances, I could imagine needing to create many billions of objects!
  • should I switch to base64 or base58 format, and just let users cope when naughty or obscene words appear in their user IDs?
    • in terms of user-experience, are compact IDs worth the inevitable unfortunate words?
    • should I invent my own compact encoding to avoid naughty words, perhaps using only numbers plus uppercase and lowercase consonant letters (no vowels)
  • in MongoDB terms, what's the performance difference between storing and indexing these IDs as strings versus BinData?

I was hoping to gain some different perspectives about this general problem, because once I deploy my solution here, it would surely be very painful to go back and revise these decisions!

ChaseMoskal
  • 7,151
  • 5
  • 37
  • 50

2 Answers2

2

Given the criteria you presented, I think you can probably safely use a UUID/GUID.

For example, the number of random version-4 UUIDs which need to be generated in order to have a 50% probability of at least one collision is 2.71 quintillion ...

This number is equivalent to generating 1 billion UUIDs per second for about 85 years. A file containing this many UUIDs, at 16 bytes per UUID, would be about 45 exabytes.

Thus, the probability to find a duplicate within 103 trillion version-4 UUIDs is one in a billion.

Source: Wikipedia

UUID will work well with both SQL and MongoDB (the MongoDB ObjectId does have a timestamp so that's not great for your situation.)

There are many ways to generate a UUID:

Justin Jenkins
  • 26,590
  • 6
  • 68
  • 1,285
  • the standard "uuid" format would be a downgrade to user experience, because uuid's hyphens would act as word-break characters, making them harder for users to double-click to copy-paste. additionally, uuid's weird legacy format wastes bits that do not contribute to the entropy of the id — i would however, like to understand if there are any upsides whatsoever to using uuid's as opposed to my current implementation which generates 256 random bits and represents them as hex characters – ChaseMoskal Mar 12 '21 at 06:07
  • 1
    @ChaseMoskal for "presentation" you don't technically need the hyphens. You could even remove them and then programmatically add them back in for ingestion into whatever API you have going. I would strongly suggest you use something standardized like a UUID ... as there are so many database and code libraries already highly optimized for UUID's. – Justin Jenkins Mar 12 '21 at 06:12
  • i don't want any bespoke logic to "fix" UUIDs shortcomings and pitfalls, i think i want more entropy, and i don't want any wasted characters — you mentioned that databases are performance-optimized for UUIDs, but surely that just means they are stored as binary blobs, which i can easily do too... right? — is there some upside to UUID that i might be missing? – ChaseMoskal Mar 12 '21 at 06:57
  • 1
    UUIDs don't waste bytes and have no pitfalls. If you dislike the hyphens, a simple `replace` when you query will get rid of them. – Laurenz Albe Mar 12 '21 at 07:06
  • given that i'm already leaning towards using a more compact text representation for the IDs than hex, UUID seems even less likely to be a good option — i'm mostly concerned about the amount of entropy, which maybe i can tone it down from 256 bit to the UUID-equivalent 128 bit — i'm also worried about any performance gotchyas in storing and indexing these values, of it storing them as straight strings has only a negligible performance penalty, as it would be much simpler for my application code to store them as simple strings – ChaseMoskal Mar 12 '21 at 07:19
  • @LaurenzAlbe — and you are incorrect, unfortunately, UUIDs *do waste bits* — `a random version-4 UUID will have 6 predetermined variant and version bits, leaving 122 bits for the randomly generated part` https://en.wikipedia.org/wiki/Universally_unique_identifier#Version_4_(random) – ChaseMoskal Mar 12 '21 at 07:22
  • Looks like I was wrong. Then cut out those bits. Or use the fractional part of `random()`. – Laurenz Albe Mar 12 '21 at 07:52
  • @ChaseMoskal so, just a suggestion: Your main concerns seem to be around two very separate things. Presentation and optimization. The presentation part can be really whatever you’d like. You can covert this all this to emoji practically speaking and have very solid tests to confirm everything is right. For storage however it probably makes the most sense to use a format that it widely supported instead of rolling your own ‘solution’ ... – Justin Jenkins Mar 12 '21 at 07:59
  • @LaurenzAlbe — using the fractional part of `random()` isn't a good option, because Math.random is not cryptographically secure — instead, i'm generating random bytes directly (both in node and in browser with their separate apis for doing so) – ChaseMoskal Mar 12 '21 at 08:10
  • @JustinJenkins — yes, there are two main aspects affecting the decision for my system's IDs (a) performance and (b) human readability — in my question, i enumerated four specifics i'm asking for advice about, all of which so far remain unanswered — it's not an easy decision.. of course, i want the IDs to be short, so 128 bit is desirable, but i'm seeing some articles claim 128 bits can be too small.. i had chosen 256 bits, but it's possible 160 bits might be a more reasonable middle-ground, which would be merely 24 chars long if i concoct a base48 format without vowels or ambiguous characters – ChaseMoskal Mar 12 '21 at 08:15
  • @ChaseMoskal If you already know the solution, why do you ask at all? What is Math.random, and what does it have to do with the PostgreSQL function `random()`? – Laurenz Albe Mar 12 '21 at 08:25
  • 1
    @ChaseMoskal keep it simple. You can always expand on the simple ... but you will likely always regret the overly complex nonstandard ... “solutions”. – Justin Jenkins Mar 12 '21 at 08:29
  • @LaurenzAlbe — i'm not using postgres exclusively, so i can't rely on postgres functionality, although i do need to know if postgres will index binary blobs efficiently — and i do not have the solution, i merely have an interim plan for which i'm seeking advice — although this little discussion is helping solidify my thoughts altogether :) – ChaseMoskal Mar 12 '21 at 08:44
  • @JustinJenkins — `You can always expand on the simple` — as mentioned in the question, i'm under the impression it might not be very easy or painless to change IDs in the system after launch — so i want to be extra careful about this decision and get it right from the get-go — so far, i'm leaning towards either 160-bit or 256-bit randomly-generated IDs, encoded in a custom base43 format without vowels or ambiguous characters to avoid some poor users getting stuck with obvious naughty words in their immutable user ids :) let me know if anybody can think of any downsides to the approach – ChaseMoskal Mar 12 '21 at 08:49
1

See my section on unique identifiers for advice on generating unique identifiers in your application.

Asking yourself the following questions can help here:

  1. Can the application easily check identifiers for uniqueness within the desired scope and range (e.g., check whether a file or database record with that identifier already exists)?
  2. Can the application tolerate the risk of generating the same identifier for different resources?
  3. Do identifiers have to be hard to guess, be simply "random-looking", or be neither?
  4. Do identifiers have to be typed in or otherwise relayed by end users?
  5. Is the resource an identifier identifies available to anyone who knows that identifier (even without being logged in or authorized in some way)?
  6. Do identifiers have to be memorable?

In particular:

  • Are other users allowed to access the resource identified by the ID, whenever they know the ID? If not, then additional access control or a long ID length will be necessary.
  • Can your application tolerate the risk of duplicate IDs? If so, then the IDs can be completely randomly generated (as you're doing now, for example). If not, then your goal will be harder to achieve, especially for keys intended for security purposes; see this question, for example.
  • If you want IDs that have to be typed in by end users, you should consider choosing a character set carefully or allowing typing mistakes to be detected.
Peter O.
  • 32,158
  • 14
  • 82
  • 96