Random token generation - a supposedly unlikely collision occurred

Question

A couple months ago, we were using UUIDs to generate random string IDs that needed to be unique across the board. I then changed the algorithm in order to save some data and index space in our database. I tested a few ways to generate unique string IDs, and I decided to use this function:

function generateToken($length) {
    $characters = '0123456789abcdefghijklmnopqrstuvwxyz';
    $max = strlen($characters) - 1;

    $token = '';
    for ($i = 0; $i < $length; $i++) {
        $token .= $characters[mt_rand(0, $max)];
    }

    return $token;
}

I'm using this function to generate IDs that are 20 characters long using digits and letters, or you could say these IDs are numbers in base 36. The probability of any 2 IDs colliding should be 1/36^20, but due to the birthday paradox, it can be expected for a collision to occur after about 36^10 records - that's 3.6 quadrillion records. Yet, just a few hours ago a collision occurred, when there were only 5.3 million existing records in the database. Am I extremely unlucky, or my ID-generating function is flawed with respect to randomness? I know mt_rand() isn't truly random, but it is random enough, isn't it?

I would've written a loop that checks if the generated ID is unique and generates a new one if it isn't, but I thought that the chance of getting a collision was so small that the performance cost of such a loop wasn't worth it. I will now include such a loop in the code, but I'm still interested in perfecting the ID generation function if it is indeed flawed.

Most database management systems have functionality to generate uuids that are guaranteed to be unique for a specific database instance. Why do you not use these? — NineBerry, Jun 09 '18 at 10:11
I swapped UUIDs for this kind of base 36 IDs due to space constraints. I needed to pack as much information into a bit of database space as possible, while still utilizing an algorithm that generates long and sophisticated enough IDs to make collision highly unlikely. UUIDs are in base 16 and contain dashes and some non-random characters, so they're not quite as space efficient as I would like. — Jeff, Jun 09 '18 at 10:17
What you describe is just a string representation of a uuid. A uuid is actually a binary construct that has a size of exactly 16 byte. Most DBMS specifically support a uuid column type and then store the 16 byte binary representation, not the string representation. What DBMS do you use? — NineBerry, Jun 09 '18 at 10:25
I just read a little bit of the mt_rand base, and it is hard for me to believe that this is based in the random number generator. how did you generate your seeds? could there maybe be a collision? for example by taking the time as seed and executing the program several times in parallel? — user287107, Jun 09 '18 at 10:30
MySQL. I am not aware of a UUID column type. We were using just a regular CHAR column when we needed to store UUIDs. — Jeff, Jun 09 '18 at 10:30
user287107, I'm a rookie when it comes to cryptography, so I'm not quite sure what you mean by how I generated my seeds and executing my program in parallel. I let the server do everything for me, so the seeds should be pseudo random as the mt_rand() function itself. — Jeff, Jun 09 '18 at 10:42
MySQL does not directly support an uuid column type, but you can store uuids in a varbyte(16) column and use the built-in functions uuid_to_bin / bin_to_uuid to convert between string and binary representation. — NineBerry, Jun 09 '18 at 10:57

score 3 · Answer 1 · answered Jun 09 '18 at 10:45

If you want guaranteed unique 16 byte IDs then I would use encryption. AES uses 16 byte (128 bit) blocks and as long as the inputs are unique the outputs are also guaranteed unique.

Set up AES in ECB mode (which is simpler and faster) and encrypt the numbers 0, 1, 2, 3, 4, ... Your inputs are unique so the outputs will be unique as well.

Crypto sites will tell you that ECB mode has security problems, but those problems only apply if the inputs are not unique. For unique 'random' number generation, as you require, those problems do not apply as your inputs are all unique.

r3mainer · Accepted Answer · 2018-06-09T16:14:47.187

The implementation of mt_rand() in PHP is rather fluid, so it may differ from one version to the next. However, here are some excerpts from the code used in PHP version 5:

php_rand.h:

/* MT Rand */
#define PHP_MT_RAND_MAX ((long) (0x7FFFFFFF)) /* (1<<31) - 1 */ 

#ifdef PHP_WIN32
#define GENERATE_SEED() (((long) (sapi_get_request_time(TSRMLS_C) * GetCurrentProcessId())) ^ ((long) (1000000.0 * php_combined_lcg(TSRMLS_C))))
#else
#define GENERATE_SEED() (((long) (sapi_get_request_time(TSRMLS_C) * getpid())) ^ ((long) (1000000.0 * php_combined_lcg(TSRMLS_C))))
#endif

PHPAPI void php_srand(long seed TSRMLS_DC);
PHPAPI long php_rand(TSRMLS_D);
PHPAPI void php_mt_srand(php_uint32 seed TSRMLS_DC);
PHPAPI php_uint32 php_mt_rand(TSRMLS_D);

rand.c:

PHP_FUNCTION(mt_rand)
{
    long min;
    long max;
    long number;
    int  argc = ZEND_NUM_ARGS();

    if (argc != 0) {
        if (zend_parse_parameters(argc TSRMLS_CC, "ll", &min, &max) == FAILURE) {
            return;
        } else if (max < min) {
            php_error_docref(NULL TSRMLS_CC, E_WARNING, "max(%ld) is smaller than min(%ld)", max, min);
            RETURN_FALSE;
        }
    }

    if (!BG(mt_rand_is_seeded)) {
        php_mt_srand(GENERATE_SEED() TSRMLS_CC);
    }

From the last three lines above, you can see that mt_rand() is automatically seeded the first time it is called. However, the php_mt_srand() function takes an argument of type php_uint32. This means there are only 2³² possible seeded states for mt_rand(). So if your script runs roughly 2¹⁶ times, it is quite likely that mt_rand() will produce the exact same sequence of random numbers.

As suggested by rossum, it would be a much better idea to apply AES encryption to an incrementing 128-bit value. If you base64-encode the encrypted results and discard the trailing ==, then the resulting strings will only be 22 characters long.

Addendum

I left the following script running while I was out this afternoon:

for i in $(seq 1 100000) ; do
  php -r 'for ($n=0; $n<32; $n++) echo chr(mt_rand(97,122)); echo chr(10);' >>out
done &

As expected, the first collision occurred after about 2¹⁶ iterations (which is nowhere near 26¹⁶):

$ sort <out | uniq -d
vnexqclzkaluntglgadgwzjnjfsvqfhp

$ grep -n vnexqclzkaluntglgadgwzjnjfsvqfhp out
34417:vnexqclzkaluntglgadgwzjnjfsvqfhp
52159:vnexqclzkaluntglgadgwzjnjfsvqfhp

That's quite interesting. What about random_int()? Is it any better than mt_rand() in terms of randomness? — Jeff, Jun 09 '18 at 15:17
@Jeff Much better. According to the [documentation](http://php.net/random_int), `random_int()` defaults to sourcing random bytes from `/dev/urandom`, so the chance of a collision among 20-digit base-36 numbers is really tiny. But not zero, of course. If a collision would have serious consequences, use AES encryption instead. — r3mainer, Jun 09 '18 at 16:18
FYI, using ECB mode on a monotonically incrementing value is called counter (CTR) mode, and your crypto library may have direct support for that. Note that it's recommended to set the top 64 bits to a random value (nonce). — StephenS, Nov 04 '18 at 00:49

Random token generation - a supposedly unlikely collision occurred

2 Answers2

php_rand.h:

rand.c:

Addendum