60

How would you go about generating the unique video URL's that YouTube uses?

Example:

BenB
  • 10,300
  • 7
  • 32
  • 30
  • 38
    If you own Scrabble, just shake the bag and pick the first few :/ – Peter Ajtai Jun 14 '10 at 03:44
  • @Peter: That would get tedious if you need more than one a minute or so... – RCIX Jun 14 '10 at 04:16
  • @Peter: Didn't know Scrabble had number tiles as well :) Maybe you're talking about this special release http://www.ajmetter.com/uploads/1/0/0/8/1008287/9043661.jpg – Jacob Jun 14 '10 at 15:01
  • many coprs uses those kinda sequences. they are usually encoded and decoded to map against the data location, in their storage. and as stated usually pretty simple algorithms. – DarthVader Jun 14 '10 at 17:42
  • Why do you want to this?. Is it (a) so that people can't guess the URLs or (b) so that the URLs are uniformly distributed in a directory hierarchy in a file system (to work around slow directory searches). – MZB Jun 14 '10 at 17:56
  • Lin: the question linked didn't specifically ask just for hashing algorithms, although its accepted answer suggests one. – BoltClock Sep 05 '10 at 06:20
  • @MZB Probably the same reason youtube uses it. – user Jun 22 '14 at 06:00

11 Answers11

57

YouTube uses Base64 encoding to generate IDs for each video.Characters involved in generating Ids consists of

(A-Z) + (a-z) + (0-9) + (-) + (_). (64 Characters).

Using Base64 encoding and only up to 11 characters they can generate 73+ Quintilian unique IDs.How much large pool of ID is that?

Well, it's enough for everyone on earth to produce video every single minute for 18000 years.

And they have achieved such huge number by only using 11 characters (64*64*64*64*64*64*64*64*64*64*64) if they need more IDs they will just have to add 1 more character to their IDs.

So when video is uploaded on YouTube they basically randomly select from 73+ Quintilian possibility and see if its already taken or not.if not use it otherwise look for another one.

Refer to this video for detailed explanation.

Rahul Sharma
  • 2,867
  • 2
  • 27
  • 40
Sunil Kumar Jha
  • 841
  • 1
  • 8
  • 11
  • 8
    I've seen this video before, but what I'm not sure that "checking if it's already been taken" is as simple as it sounds. What if it gets chosen by a different server while that server is checking? By the time it comes back it can't possible be so sure anymore – Brian Leishman Aug 06 '18 at 20:49
  • @BrianLeishman you can control that concurrency implementing a centralized algorithm that provides each hash with a semaphore (or locking mechanism). – ClownCoder Jan 05 '19 at 05:36
  • 3
    yeah they have to use a lock... and won't that make that single server very busy if there are 200 servers constantly checking with that 1 server? I think one way is for that single server to generate 1000 of such base64 IDs at a time, register them being taken, and each time any of those 200 servers want some IDs, just give them 1000 at a time – nonopolarity Dec 10 '19 at 07:42
  • If you want to do this in JS, NanoId is the go-to package https://github.com/ai/nanoid – James L. Nov 08 '20 at 00:32
  • @nonopolarity Why not give each server a prefix? – elad.chen Jul 19 '21 at 07:08
  • Youtube ids are not 64^11. Internally, Youtube is most likely using an unsigned int 64 as the id for all of their videos. The last character of a Youtube video can only be one of 16 values (not 64). So the true id space is 64^10 * 16 -- which happens to also be the same value as the largest unsigned 64 bit int possible. – Jason Baumgartner Oct 03 '21 at 03:43
  • I would guess that YouTube has a unique constraint on video ID in their database. A server could generate a random ID and try to insert the video metadata into the database, and in the unlikely event that the ID is already taken, it would error and the server could generate a new one and try again. Some Googling tells me that YouTube uses a technology called Vitess which is a clustering technology for MySQL. I don't know the details of that, but having a unique constraint like that is easy in MySQL. – Aurast Dec 20 '21 at 22:21
28

Using some non-trivial hashing function. The probability of collision is very low, depending on the function, the parameters and the input domain. Keep in mind that cryptographic hashes were specifically designed to have very low collision rates for non-random input (i.e. completely different hashes for two close-but-unequal inputs).

This post by Jeff Attwood is a nice overview of the topic.

And here is an online hash calculator you can play with.

Eli Bendersky
  • 263,248
  • 89
  • 350
  • 412
  • 2
    Are there any references that YouTube uses a hashing function? – cherouvim Jun 14 '10 at 04:43
  • 13
    Why even allow the possibility of a collision when you don't need to? Just start the ID at 0000000000 and increment for each new page. If that doesn't "look" random enough for you, then just take a couple of large prime numbers, P and Q, and use the sequence hash(n)=Pn mod Q – Peter Alexander Jun 14 '10 at 08:20
  • @cherovium: I don't have a reference. But my answer was more general "idea throwing" than a specific reference to YouTube. After all, the OP's question started with "How would you go about"... :-) – Eli Bendersky Jun 14 '10 at 14:59
  • 3
    @PeterAlexander they don't increment it like that so you can't view private videos – Lightning Gaming Jun 15 '21 at 19:44
11

There is no need to use a hash. It is probably just a quasi-random 64 bit value passed through base64 or some equivalent.

By quasi-random, I mean it is just a one-to-one mapping with the counting integers, just shuffled.

For example, you could take a monotonically increasing database id and multiply it by some prime near 2^64, then base64 the result. If you did not want people to be able to guess, you might choose a more complex mapping or just pick a random number that is not in the database yet.

Normal base64 would add an equals at the end, but in this case it is implied because the size is known. The character mapping could easily be something besides the standard.

drawnonward
  • 53,459
  • 16
  • 107
  • 112
  • 3
    *"For example, you could take a monotonically increasing database id and multiply it by some prime near 2^64, then base64 the result."* Why a prime number? And why so large? I understand the larger the more "secure" the resulting id may be, but that's a really large number, especially after being multiplied by a scalar ID, say in the hundred-million value range. – HankLloydRight Dec 13 '12 at 20:49
5

You can use any library or some languages like python provides it in standard library.

Example:

import secrets


id_length = 12
random_video_id = secrets.token_urlsafe(id_length)
AVX-42
  • 755
  • 2
  • 13
  • 21
4

Eli's link to Jeff's article is, in my opinion, irrelevant. URL shortening is not the same thing as presenting an ID to the world. Instead, a nicer way would be to convert your existing integer ID to a different radix.

An example in PHP:

$id = 9999;
//$url_id = base_convert($id, 10, 26+26+10); // PHP doesn't like this
$url_id = base_convert($id, 10, 26+10); // Works, but only digits + lowercase

Sadly, PHP only supports up to base 36 (digits + alphabet). Base 62 would support alphabet in both upper-case and lower-case.


People are talking about these other systems:

  • Random number/letters - Why? If you want people to not see the next video (id+1), then just make it private. On a website like youtube, where it actively shows any video it has, why bother with random ids?
  • Hashing an ID - This design concept really stinks. Think about it; so you have an ID guaranteed by your DBM software to be unique, and you hash it (introducing a collision factor)? Give me one reason why to even consider this idea.
  • Using the ID in URL - To be honest, I don't see any problems with this either, though it will grow to be large when in fact you can express the same number with fewer letters (hence my solution).
  • Using Base64 - Base64 expects bytes of data, literally anything from nulls to spaces. Why use this function when your data consists of a number (ie, a mix of 10 different characters, instead of 256)?
Christian
  • 27,509
  • 17
  • 111
  • 155
  • 2
    "On a website like youtube, where it actively shows any video it has, why bother with random ids?" -- Because a sequential numbering system allows for extremely easy bulk downloading of videos. They started this from the beginning, and in Youtube's early days in 2005, having all of their content siphoned was a very real possibility. Perhaps this would be a concern for anyone just starting out their own video hosting site, like OP. – pbarney Jul 21 '20 at 18:53
3

You could generate a GUID and have that as the ID for the video. Guids are very unlikely to collide.

Telavian
  • 3,752
  • 6
  • 36
  • 60
3

Your best bet is probably to simply generate random strings, and keep track (in a DB for example) of which strings you've already used so you don't duplicate. This is very easy to implement and it cannot fail if properly implemented (no duplicates, etc).

Cam
  • 14,930
  • 16
  • 77
  • 128
1

I suggest using a perfect hash function:

Perfect Hash Function for Human Readable Order Codes

As the accepted answer indicates, take a number, then apply a sequence of "bijective" (or reversible) operations on the number to get a hashed number.

The input numbers should be in sequence: 0, 1, 2, 3, and so on.

Community
  • 1
  • 1
Peter O.
  • 32,158
  • 14
  • 82
  • 96
1

I don't think that the URL v parameter has anything to do with the content (video properties, title, description etc).

It's a randomly generated string of fixed length and contains a very specific set of characters. No duplicates are allowed.

cherouvim
  • 31,725
  • 15
  • 104
  • 153
0

Just pick random values until you have one never seen before.

Randomly picking and exhausting all values form a set runs in expected time O(nlogn): What is O value for naive random selection from finite set?

In your case you wouldn't exhaust the set, so you should get constant time picks. Just use a fast data structure to do the duplication lookups.

Community
  • 1
  • 1
Thomas Ahle
  • 30,774
  • 21
  • 92
  • 114
0

Typically you're hiding a numeric identifier in the form of something that doesn't look numeric. One simple method is something like base-36 encoding the number. You should be able to pull that off with one or another variant of itoa() in the language of your choice.

Ian
  • 4,421
  • 1
  • 20
  • 17