How was there no collision among 50,000 random 7-digit hex strings? (The Birthday Problem)

Question

I've encountered some code that generates a number of UUIDs via UUID.randomUUID(), takes the last 7 digits of each (recent versions of UUID are uniformly distributed in terms of entropy), and uses that as a key to insert rows into a database.

I wondered what the probability of collision was. I remembered the Birthday Problem. This is an instance of that problem, is it not? Instead of 365 days in a year, there are 16^7 possible strings. Then according to that Wikipedia page, the probability of collision given n strings is

where d is 16^7.

A colleague asserted that they were able to insert 50,000 rows without an issue. I doubted this because the probably of collision with n=50,000 and d=16^7 is

1-((16^7-1)/16^7)^(50000*(50000-1)/2) = 99.05%

But then I checked our database. Indeed the 50,000 inserts succeeded.

How was that possible? (Besides that they got really lucky.) Am I misapplying the Birthday Problem?

Edit

Here's a minimal test that shows there should have been collisions.

import java.util.UUID;
import java.util.HashSet;

public class MyClass {
    public static void main(String args[]) {
        final int N = 50000;
        for (int trial = 0; trial < 10; trial++) {
            HashSet<String> strings = new HashSet<>();
            Boolean success = true;
            for (int i = 0; i < N; i++) {
                String uuid = UUID.randomUUID().toString();
                String id = uuid.substring(uuid.length() - 7);
                if (strings.contains(id)) {
                    success = false;
                    // System.out.println(id);
                    break;
                }
                strings.add(id);
            }
            System.out.println(success);
        }
    }
}

For N=50,000, 10 out of 10 attempts had a collision—which would be expected of a 99% collision rate. For N=10,000, I see 0, 1, 2, or 3 out of 10 attempts with collisions, which all fall within expectations for a 17% collision rate.

...no idea. One fun test to try, would be to through up a quick code sample that uses `UUID.randomUUID()` to generate 50,000 7-digit IDs. Do you see collsions 99% of the time you run this program? Perhaps the pseudo-random-number generator behind `UUID.randomUUID()` is skewing the probability somehow. — Scotty Jamison, Apr 24 '23 at 19:49
I did a quick test and got 49993 uniques. Can you share a [mcve]? — shmosel, Apr 24 '23 at 19:59
Are you sure you are calculating the formula right, and using the right formula at it? 16^7=268435456, which is way more than 50000. You should not have too much collisions. Unlike the traditional birthday problem where the number of days is just 365 and number of people is usually higher. — aled, Apr 24 '23 at 20:00
Given that your arithmetic is correct, there are four plausible explanations: (1) you got lucky, (2) the IDs are not actually truncated to 7 digits, (3) the uniqueness constraint is not being applied properly, (4) the code doing the insertions had some logic to check and retry when it generated a duplicate. There is not really any way for us to know what happened for sure with just the information in your question. — kaya3, Apr 24 '23 at 20:00
This site agrees with your 99.05% calculation https://www.bdayprob.com/ https://i.stack.imgur.com/HvtnG.png — Martin Smith, Apr 24 '23 at 20:40
@aled *16^7=268435456, which is way more than 50000.* True, but it's not way more than 50000 *squared*, which is what matters. The probability of a collision is approximately `1 - exp(-n^2/2d)` - see https://en.wikipedia.org/wiki/Birthday_problem#Approximations — slothrop, Apr 24 '23 at 20:47
that proves that we can not trust common sense related to cryptography — aled, Apr 24 '23 at 20:48
@aled yes indeed, that's why they call it a paradox :) It makes mathematical sense but it surprises our everyday intuitions — slothrop, Apr 24 '23 at 20:51
99% is a fairly high probability, but a 1-in-100 chance is still fairly plausible. Given that there doesn't seem to be any alternative explanation (especially since example runs were made that produced similar probabilities), and assuming you've read the code right and there's nothing else funky going on, you may just have to say that this was crazy luck. But, if you don't fix the issue soon, you're going to run into a collection fairly quickly - chances of collisions just get higher the more you stick in there! — Scotty Jamison, Apr 24 '23 at 20:59
(or, maybe collisions have occurred - do you have a way of knowing when a collision has happened and caused an insertion to fail?) — Scotty Jamison, Apr 24 '23 at 21:00
Stepping back a bit, if the requirement is to give each DB row a key which is guaranteed unique but doesn't need to be meaningful - the DBMS should be able to take care of that for you if you define the column appropriately. — slothrop, Apr 24 '23 at 21:03
Thanks for everyone for thoughts and ideas. I've edited in a mini test. I inherited this service recently—the successful 50,000-insert happened in 3 months ago—we don't keep GCP logs more than 30 days old unfortunately. This code is used for generating 1-time use promo codes, _e.g._ SUMMER23-65ACFF9, so it needs some randomization, can't just be an incremented field in the DB. I checked the code in Github history back to January—I see nothing that would have aided the success. But of course I agree with everyone that I haven't given enough info for a definitive answer—I was looking for ideas. — Andrew Cheong, Apr 24 '23 at 21:16
Reproduced in Python: `from uuid import uuid4; [len({uuid4().hex[:7] for i in range(50000)}) for i in range(100)]`. I got one collision-free set in 200 trials. — BoppreH, Apr 24 '23 at 21:23
I finally have an explanation. I've self-answered this question. Thank you to everyone who contributed possibilities (some of you _e.g._ @kaya3 hit the nail on the head or were very close). — Andrew Cheong, Apr 25 '23 at 09:37
Your image is black on a dark gray background (in dark mode). It's nearly impossible to read. — Jorn, Apr 25 '23 at 11:55
@Jorn - Oh, thanks for pointing that out. I copied the image directly from Wikipedia. It must have been a transparent PNG. I've replaced it with a screenshot. — Andrew Cheong, Apr 25 '23 at 12:24
@AndrewCheong If you want to get fancy, you can use Google's allegedly-deprecated Charts API too. Short version of what you used, [here](https://chart.googleapis.com/chart?cht=tx&chl=p%28n%3Bd%29%20%5Capprox%201%20-%20%5C%28%5Cfrac%7Bd-1%7D%7Bd%7D%5C%29%5E%7B%5Cfrac%7Bn%28n-1%29%7D%7B2%7D%7D) — General Grievance, Apr 26 '23 at 20:47

Andrew Cheong · Accepted Answer · 2023-04-30T16:04:25.063

Finally I have an explanation. Thank you everyone for helpful ideas.

tl;dr - I thought there had had to have been a unique constraint at the time of insert, by the fact that indeed there were 50,000 distinct codes in the database. But it turned out there was not a constraint at that time. There indeed were duplicates among the 50,000 originally, that a month later were found and modified via a one-time SQL statement (modified by appending a "1").

Full explanation:

This code was about generating one-time use promo codes such as SUMMER23-65ACFF9. This is how I reasoned that indeed 50,000 distinct codes had been inserted into the database:

This table didn't have a timestamp field (e.g. last_modified) or I might have had a hint sooner. I only knew that the batch of 50,000 promo codes were inserted on January 6th, 2023—3 months ago.

I browsed the repo at the last commit before January 6th, 2023 to see if something about the code back then might have allowed 50,000 codes to succeed. The relevant code was this:

I believed that, because of the calls to .rollback(), the insert of 50,000 rows was being done atomically. In other words, if a single insert failed, all the inserts up until then should have been rolled back.

(So one possibility was that my colleague(s) just kept retrying and retrying until they hit the 1% jackpot. But when I asked them, that didn't seem to be the case. They did not recall having to retry at all.)

I did wonder if the constraint for unique promo_code_id existed at the time of insertion. I checked that it did exist:

I was hoping to find a timestamp for when this constraint was created but didn't immediately see one. I should have pursued that a little further, but I didn't because: I had already queried for a count of distinct promo_code_ids and gotten 50,000 (see first screenshot above). Constraint or no constraint, the probability of ending up with 50,000 distinct codes was still less than 1%. This is where I made a faulty assumption: that the codes hadn't been tampered with since.

Today I came across a Liquibase XML change in February (a month after the "successful" 50,000 inserts) where we apparently added the constraint:

But notice the SQL that comes with that change, besides adding the unique constraint. We essentially ran a script to add a "1" to the end of all duplicate codes. So that is how the "insert of 50,000 codes without duplicates" succeeded: It didn't—it was without constraint in the first place, and corrected afterward.

Please use text, not images, for what can be given by text. https://meta.stackoverflow.com/a/285557/3404097 Comments are ephemeral, please put what is relevant from comments into posts. — philipxy, Apr 28 '23 at 03:35

score 2 · Answer 2 · answered Apr 24 '23 at 23:09

Running the exact code you pasted, I get false, 10 times. Which means: Yes, collision occurs, quite reliably. Continuing to run until a 'true' is returned, I get:

Which seems to match the math, that's about what you expect (these numbers represent: The 39249th id I added collided with an already existing ID).

I assume that you have asked this question because you didn't actually run this code on your machine or misinterpreted what 'false' means.

If truly you run the EXACT code you pasted and you get anything but 10x false (well, 1 in 120 or so would be a true by happenstance), then something very, very bizarre is going wrong, most likely related to exactly how random number generation works on your bizarro JVM. But, that sounds very farfetched indeed.

Here is the slightly adjusted code to print point-of-duplication:

        final int N = 20000;
        for (int trial = 0; trial < 10; trial++) {
            HashSet<String> strings = new HashSet<>();
            int i = 0;
            while (true) {
                i++;
                String uuid = UUID.randomUUID().toString();
                String id = uuid.substring(uuid.length() - 7);
                if (strings.contains(id)) {
                    success = false;
                    System.out.println(i);
                    break;
                }
                strings.add(id);
            }
//            System.out.println(success);
        }

Some explanations for what your friend has observed. These are wild stabs in the dark:

They do have collisions and misconstrued the situation: Whatever checks they did to ensure there are no collisions was done erroneously.
Their code will pick up on a collision and 'reroll the dice'. In which case your application will grind to a halt over time, as you get closer and closer to 16^7 entries in the database, the amount of rerolls on the random UUID until you luck into one that doesn't collide grows exponentionally near the end. Your friend didn't tell you this, or forgot that their code does this.
The query as designed doesn't actually insert the last 7 characters of the UUID, but instead inserts the whole UUID.
The query as designed doesn't actually insert the last 7 characters of the UUID, but instead the db default which is e.g. an auto increment (a db SEQUENCE).
The UUID generator your friend uses isn't a 'normal' one (that randoms it up) but a modified variant; various such variants exist; they could for example have their own UUID generator that just generates 000001, 000002, and so on. In which case you won't get a collision until 16^7 records are added.

Er, why would I write that code and then not run it? My last paragraph describes the output I witnessed. I have also shared more details in the comments under the question. And at this point I've not only run it on my machine, but on the Kubernetes instance that the production service runs on. (I wondered if the UUID generation was sensitive to hardware / MAC address / # of CPUs etc.) I'm here precisely because the situation is bizarre that I know I must be wrong about something. Thanks for all the suggestions though—they give me some more ideas to test. — Andrew Cheong, Apr 25 '23 at 08:08
Your first "wild stab" was spot on. I've self-answered this question. Thanks for your ideas. — Andrew Cheong, Apr 25 '23 at 09:38
I think I misread your question and thought that you interpreted your own code experiment as indicating no collisions occurred; you added it to show that in your code collisions DO occur, and you were wondering why the DB (at the time you wrote the question) appeared not to have any. I apologize for not properly reading that last paragraph! — rzwitserloot, Apr 25 '23 at 13:20
Ah I see, yeah I can see how it would be read as example code showing the negative not the positive. I've added a caption to clear that up. Thanks for explaining! — Andrew Cheong, Apr 25 '23 at 14:23

score 1 · Answer 3 · answered Apr 25 '23 at 22:39

1

Interesting statistics - I'm missing the "business-side-view" in this discussion. Assume that 2% - 5% of targeted customers are effectively using the promo code: What is the chance of a customer getting a message "your promo code already used" AND going to competition? Rate of customer loss is probably not worth the effort of all this complex dual-platform uniqueness-enhancement. From a statistical viewpoint: adding a "1" at DB-level won't fix a triple occurence, but as stated above: it won't matter.

answered Apr 25 '23 at 22:39

OldFrank

888
6
7

Yup, these are the practical considerations of the real world. It turned out adding a "1" was effective enough because that would make an 8-digit code and all other codes were 7-digit. Since it could not collide with any of the 7-digit codes, it had a very, very good chance of succeeding, which it did, but your caution is valid. Interestingly in this case it wasn't the customers that reported issues but account managers trying to create tens of thousands of codes at once and receiving an status code 500. I am now in talks with Product Managers to increase to 10 digits. So it goes. – Andrew Cheong Apr 26 '23 at 03:24

How was there no collision among 50,000 random 7-digit hex strings? (The Birthday Problem)

3 Answers3