1

I'm working on a video share project and I would like to generate "characters" id's for each video similar to how youtube does it. for example tgax-1sCgIs

Is it safe to use the following function to generate UUIDs, If for example I've 100000000 videos and I need to add new uuid, how can I be sure it's not duplicated?

function generateRandomString($length = 11) {
    $characters = '0123456789abcdefghijklm-_nopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ';
    $randomString = '';
    for ($i = 0; $i < $length; $i++) {
        $randomString .= $characters[rand(0, strlen($characters) - 1)];
    }
    return $randomString;
}
whveaguw
  • 59
  • 2
  • 2
  • 5
  • http://stackoverflow.com/questions/2040240/php-function-to-generate-v4-uuid – elclanrs Sep 09 '13 at 23:18
  • how are you adding them? if they go in to a db you set the unique flag. –  Sep 09 '13 at 23:21
  • Why don't you simply hash the the auto_incremented ID provided by database using salt and some long algorithm like whirlpool for example? You'll get character IDs that are nearly impossible to guess and there's virtually 0 chance for clash with such a low number of entries. – N.B. Sep 09 '13 at 23:47

3 Answers3

4

The following strictly addresses UUIDs. The URLs used by Youtube are not UUIDs and cannot be compared as such. They are much smaller (over 293 times smaller!) and do not have the same guarantees of such an immensely huge domain as a UUID. In this case (for "short hash tags"), duplicate checking must be used - but it need not differ than any other kind of duplicate checking.


If you create a UUID from a proper generator (e.g. a random UUIDv4 generator), then you can be assured that the probability of duplicates is "so low that it just doesn't matter".

As such, while I normally suggest not checking for duplicate UUIDs, there are cases when doing so is pertinent:

  1. During re-mergering (i.e. a cyclic merge operation) where duplicates from prior data are expected and will occur;
  2. The UUID comes from an untrusted generator (i.e. the UUID values be subverted/injected by an attacker or from other manual human intervention);
  3. If used as an SQL Column/Index there is no reason why a Unique Constraint should not be applied as it is required anyway to maintain proper multiplicities.

On the other hand, while I find UUIDs very good for inter-boundary identification (such as transporting information between systems or providing "long" unique resource handles), I find UUIDs very poor to be used as a standard database "record identifier". Where I need a surrogate PK, I merely use a traditional auto-incremement column which is much easier on physical layout. (SQL Server provides a special UUID generator that is much better for indexing - but less secure - than a truly random v4 UUID.)

Unfortunately, PHPs standard uniqid (a "custom" format?) function does not provide the best guarantees. In any case, see PHP function to generate v4 UUID that shows a UUIDv4(-ish?) implementations that is much better than the posted code as they conform to a common generation technique and use a much higher grade random source. (However, please see the comments relating to how mt_rand is seeded - or not seeded - in the answers.)

Community
  • 1
  • 1
user2246674
  • 7,621
  • 25
  • 28
  • If it’s something long as `18801aee-d4e2-406e-9eda-eaa593cd792c` I believe chances are low, but with a 11 characters like youtube does and having millions of videos, that should be a problem. I thought maybe implementing sphinx to do quick search. – whveaguw Sep 09 '13 at 23:35
  • An 11 character (in base64-ish encoding) tag is a far *smaller* space and is *not* a UUID :) UUIDs claim "you're more likely to get hit by a meteor" probabilities (assuming true randomness) - e.g. [probability of duplicaes](http://en.wikipedia.org/wiki/UUID#Random%5FUUID%5Fprobability%5Fof%5Fduplicates) - because of their *huge, huge* space. For a UUIDv4 that is 126 bits. Which is, like, huge. – user2246674 Sep 09 '13 at 23:35
  • @whveaguw For the youtube 11 characters, the domain of values is approximately `< 6^11` (each letter has less than 6 bits of information) ~ `< (2^3)^11` ~ `< 2^33`. That is *very small* in comparison (in fact, it is over `2^(126-33)` ~ `2^93` times smaller!), and I gave youtube a *very large error bonus*. *One cannot realistically compare a "short hash value" and a random UUID in terms of probability of collisions.* It's like comparing a galaxy and a small moon - entirely different scales! – user2246674 Sep 09 '13 at 23:44
  • @whveaguw Furthermore, since UUIDs are (usually) *used within a single domain*, the total number of UUIDs generated for a single domain is much smaller than all UUIDs ever generated. (If duplicate UUIDs exist in different domains it normally doesn't even matter.) – user2246674 Sep 09 '13 at 23:48
  • Uhh, I'm not a professional math doing guy, but I'm pretty sure that `6^11` is *not* equivalent to `(2^3)^11`, but `(2*3)^11` or `(2^11) * (3^11)` – Sammitch Sep 10 '13 at 16:38
1

If you're using a database you have a couple options:

  1. Just use the auto increment column of the table you're storing the videos in. The number will always be unique.

  2. Each time you generate an id, check the database to see if it exists. If it exists, re-run the function to generate a new uuid and check the database again. Do it until you query the database and no rows are returned with that id.

There are a few other posts you should look at that have a better approach at generating a true uuid:

Community
  • 1
  • 1
SeanWM
  • 16,789
  • 7
  • 51
  • 83
  • Option number one is not really an option since @whveaguw want's random IDs. Option two is viable, however! – Tim Groeneveld Sep 09 '13 at 23:27
  • True! Just throwing it out there though. – SeanWM Sep 09 '13 at 23:28
  • Option 2 can present a performance problem. UUIDs are made with the statistical question that's by their size and generator sources is meant to be unique so you don't need or can't check if it exists. – Diego C Nascimento Sep 09 '13 at 23:28
  • Thanks, I suggested a post to look at. – SeanWM Sep 09 '13 at 23:32
  • I personally use option number 1, then salt/hash the auto incremented id and covert the base from 16 to 62 (numbers and letters upper and lower). It is unique enough for my purposes and appears random enough to not arouse suspicion. – Jonathan Kuhn Sep 09 '13 at 23:32
  • @JonathanKuhn Option number 1 is near the de facto for db, but has nothing to do with UUID as the title suggest. – Diego C Nascimento Sep 09 '13 at 23:44
1

I'm pretty sure YouTube is just encoding integer IDs in a base-X system. There's just so many, and they are created so fast, that they seem random.

The code would look something like:

<?php

$base_str = '0123456789abcdefghijklmnopqrstuvwxyz-_';
$base = strlen($base_str);

// generate a number if no input
if( ! isset($argv[1]) ) {
    $number = rand(1000,1000000);
} else {
    $number = intval($argv[1]);
}

printf("Input: %d\n", $number);
printf("Base: %d\n", $base);

// will hold the base-X encoded representation of the number
$repr = '';

for( $i=$number; $i>0; ) {
    $remainder = $i % $base;
    $digit_repr = substr($base_str, $remainder, 1);
    $repr = $digit_repr . $repr;

    printf("Rem: %2d  Repr: %s  Cur: %16d  Progress: %s\n", $remainder, $digit_repr, $i, $repr);

    $i = ($i - $remainder) / $base;
}

Example output:

Input: 2000000
Base: 38
Rem: 22  Repr: m  Cur:          2000000  Progress: m
Rem:  1  Repr: 1  Cur:            52631  Progress: 1m
Rem: 17  Repr: h  Cur:             1385  Progress: h1m
Rem: 36  Repr: -  Cur:               36  Progress: -h1m

If you want to introduce a little more "randomness" into how the IDs look you can always scramble $base_str. Just keep in mind that you can only scramble it once before you start encoding IDs.

Decoding

I guess that's important, right?

<?php

$base_str = '0123456789abcdefghijklmnopqrstuvwxyz-_';
$base = strlen($base_str);

if( ! isset($argv[1]) ) {
    $input = '-h1m';
} else {
    $input = $argv[1];
}

printf("Input: %s\n", $input);
printf("Base: %d\n", $base);

$repr = str_split($input);
$number = 0;

for( $i=0; $i<count($repr); $i++) {
    $number = $number * $base;
    $value = strpos($base_str, $repr[$i]);
    $number += $value;
    printf("Char: %s  Value: %2d  Cur: %12d\n", $repr[$i], $value, $number);
}

Example output:

Input: -h1m
Base: 38
Char: -  Value: 36  Cur:           36
Char: h  Value: 17  Cur:         1385
Char: 1  Value:  1  Cur:        52631
Char: m  Value: 22  Cur:      2000000
Community
  • 1
  • 1
Sammitch
  • 30,782
  • 7
  • 50
  • 77