18

I'm using the following perl code to generate random alphanumeric strings (uppercase letters and numbers, only) to use as unique identifiers for records in my MySQL database. The database is likely to stay under 1,000,000 rows, but the absolute realistic maximum would be around 3,000,000. Do I have a dangerous chance of 2 records having the same random code, or is it likely to happen an insignificantly small number of times? I know very little about probability (if that isn't already abundantly clear from the nature of this question) and would love someone's input.

perl -le 'print map { ("A".."Z", 0..9)[rand 36] } 1..6'
Tom Zych
  • 13,329
  • 9
  • 36
  • 53
Nick
  • 1,311
  • 2
  • 10
  • 27
  • 7
    Why can't you just use an auto increment field? – murgatroid99 Sep 29 '11 at 00:19
  • 5
    If for some reason auto increment doesn't work, consider using a UUID instead. Those are designed to be random identifiers with a minimal chance of collision. https://metacpan.org/module/Data::UUID – Schwern Sep 29 '11 at 07:24
  • If you create a unique index in the database, you'll get an exception from DBI when a collision occurs. You could catch the exception, generate another code, and try inserting again. Of course that won't be particularly efficient when you used many of the available codes. – Grant McLean Oct 02 '11 at 06:18

6 Answers6

50

Because of the Birthday Paradox it's more likely than you might think.

There are 2,176,782,336 possible codes, but even inserting just 50,000 rows there is already a quite high chance of a collision. For 1,000,000 rows it is almost inevitable that there will be many collisions (I think about 250 on average).

I ran a few tests and this is the number of codes I could generate before the first collision occurred:

  • 73366
  • 59307
  • 79297
  • 36909

Collisions will become more frequent as the number of codes increases.

Here was my test code (written in Python):

>>> import random
>>> codes = set()
>>> while 1:
        code=''.join(random.choice('1234567890qwertyuiopasdfghjklzxcvbnm')for x in range(6))
        if code in codes: break
        codes.add(code)

>>> len(codes)
36909
Mark Byers
  • 811,555
  • 193
  • 1,581
  • 1,452
  • what if the length is not fixed? lets say it could return only 1 character, could we use formula from birthday paradox to calculate that? – buncis Jul 11 '22 at 12:57
16

Well, you have 36**6 possible codes, which is about 2 billion. Call this d. Using a formula found here, we find that the probability of a collision, for n codes, is approximately

1 - ((d-1)/d)**(n*(n-1)/2)

For any n over 50,000 or so, that's pretty high.

Looks like a 10-character code has a collision probability of only about 1/800. So go with 10 or more.

Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
Tom Zych
  • 13,329
  • 9
  • 36
  • 53
8

Based on the equations given at http://en.wikipedia.org/wiki/Birthday_paradox#Approximation_of_number_of_people, there is a 50% chance of encountering at least one collision after inserting only 55,000 records or so into a universe of this size:

http://wolfr.am/niaHIF

Trying to insert two to six times as many records will almost certainly lead to a collision. You'll need to assign codes nonrandomly, or use a larger code.

  • what if the length is not fixed? lets say it could return only 1 character, could we use formula from birthday paradox to calculate that? if not then what formula that we could use to calculate that? – buncis Jul 11 '22 at 12:57
8

As mentioned previously, the birthday paradox makes this event quite likely. In particular, a accurate approximation can be determined when the problem is cast as a collision problem. Let p(n; d) be the probability that at least two numbers are the same, d be the number of combinations and n the number of trails. Then, we can show that p(n; d) is approximately equal to:

1 - ((d-1)/d)^(n*(n-1)/2)

We can easily plot this in R:

> d = 2176782336
> n = 1:100000
> plot(n,1 - ((d-1)/d)^(n*(n-1)/2), type='l')

which gives

enter image description here

As you can see the collision probability increases very quickly with the number of trials/rows

Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
csgillespie
  • 59,189
  • 14
  • 150
  • 185
1

While I don't know the specifics of exactly how you want to use these pseudo-random IDs, you may want to consider generating an array of 3000000 integers (from 1 to 3000000) and randomly shuffling it. That would guarantee that the numbers are unique. See Fisher-Yates shuffle on Wikipedia.

Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339
Alexey Frunze
  • 61,140
  • 12
  • 83
  • 180
0

A caution: Beware of relying on the built-in rand where the quality of the pseudo random number generator matters. I recently found out about Math::Random::MT::Auto:

The Mersenne Twister is a fast pseudorandom number generator (PRNG) that is capable of providing large volumes (> 10^6004) of "high quality" pseudorandom data to applications that may exhaust available "truly" random data sources or system-provided PRNGs such as rand.

The module provides a drop in replacement for rand which is handy.

You can generate the sequence of keys with the following code:

#!/usr/bin/env perl

use warnings; use strict;
use Math::Random::MT::Auto qw( rand );

my $SEQUENCE_LENGTH = 1_000_000;
my %dict;
my $picks;

for my $i (1 .. $SEQUENCE_LENGTH) {
    my $pick = pick_one();
    $picks += 1;

    redo if exists $dict{ $pick };

    $dict{ $pick } = undef;
}

printf "Generated %d keys with %d picks\n", scalar keys %dict, $picks;

sub pick_one {
    join '', map { ("A".."Z", 0..9)[rand 36] } 1..6;
}

Some time ago, I wrote about the limited range of built-in rand on Windows. You may not be on Windows, but there might be other limitations or pitfalls on your system.

Sinan Ünür
  • 116,958
  • 15
  • 196
  • 339