9

Youtube seems to have a unique 11 digit code for each video. the code includes 1-9,A-Z,a-z, and some symbols like +_* etc.

How would they calculate this unique code for each video? I am working on something where I'd like to assign a unique code to each record so hence the question.

My questions/concerns are:

  1. If they make it on-the-fly (when videos are submitted) then they'd have to check whether the code prepared for the video already exists or not? That would be an expensive operation across huge dataset like theirs.
  2. Would they run a batch job sort of thing every night or every month that creates unique codes and stash them in the DB. Then as the video is submitted it just takes a code and marks it off as "used"
  3. Would it make sense to take the auto-generated and auto-incremented ID column for each record in the DB and then somehow convert that unique ID column to an 11 digit code?

My goal is to:

  • create a unique code for a record in the table.
  • The user can share the url with that unique code with anyone.
  • When someone comes in via the unique code. Then their "coming in" gets tied to the original user who shared the url with unique code.
Anthony
  • 33,838
  • 42
  • 169
  • 278
  • 1
    The numbers are likely sequential internally and expressed in some more terse representation that doesn't have the possibility of overlapping. The character set you describe has exactly 64 values, which means that representation is probably just base 64. – bdean20 Dec 09 '13 at 22:35
  • This question appears to be off-topic because it requests mind reading. – Raedwald Dec 18 '13 at 08:41

4 Answers4

3

Read up on GUID and UID in general.

Most times if you are using a database that will generate a unique id for you and then that unique id can be encoded to numbers and letters to shorten the resulting string.

http://en.wikipedia.org/wiki/Globally_unique_identifier

Shortening the string is about the way you encode the value, it doesn't actually change it.

For example the number 15 in base 10 uses two digits, in hex it uses one digit (f) in binary it uses 4 (1111).

In the same way you can use a-z, A-Z, 0-9 and get base 62 to encode numbers into strings using far fewer digits than using base 10.

It's not the only approach but (especially if you already have database rows for it) it's the simplest. You don't even need to pad to 11 unless you really want to - but adding any number of 0's at the start of the encoded string does not alter its value.

Java even provides functions to do this for you although the max radix on these ones is 36:

http://docs.oracle.com/javase/7/docs/api/java/lang/Integer.html#toString%28int,%20int%29

Tim B
  • 40,716
  • 16
  • 83
  • 128
  • wouldn't shortening the GUID defeat the purpose of it being unique? – Anthony Dec 09 '13 at 22:09
  • Edited to add more explanation on that. – Tim B Dec 09 '13 at 22:13
  • An 11-digit base64 string contains 66 bits, or just under 10^20 possibilities. Probably good enough but definitely not collision-proof. However, if it contains a global timestamp (permuted?) it probably is sufficient. – Jim Garrison Dec 09 '13 at 22:19
  • It's a way to encode a number that's already been generated as unique (for example the id field from a database) not a way to generate a UID. If you always want 11 characters just pad the front with 0s. – Tim B Dec 09 '13 at 22:20
  • 1
    @JimGarrison I doubt they're worried about breaking 7,378,697,600,000,000,000 videos. I don't think even Google can wrangle up enough drives for that – Floegipoky Dec 09 '13 at 22:26
  • I understand. So my process should be: generate a GUID using ID column in DB -> encode the generated GUID into base 62. If it is not always 11 characters then just pad it with 0? Is my understanding correct? Would I create GUID on-the-fly? – Anthony Dec 09 '13 at 22:26
  • 1
    @Floegipoky It's not a sequentially assigned number... it's generated randomly (maybe it contains a timestamp, maybe not) so there is a very small possibility of a collision. – Jim Garrison Dec 09 '13 at 23:01
  • @JimGarrison may be it is sequential but as many users upload videos at same time it might not be noticable. – Vikram Bhat Dec 10 '13 at 15:00
  • @JimGarrison my assumption was that it was a pseudorandomly generated id that is validated to be unique by a db lookup before it's stored. You're right, collisions would occur occasionally. If they had 737,869,760,000,000 videos there would be a .01% chance of a collision. M point was only that given the size of the numbers involved, it's going to be a long time before they have to worry about collisions. I just looked on google and [here](http://stackoverflow.com/questions/3034861/youtube-url-algorithm) but I couldn't find a definitive answer regarding how they do it, sorry Anthony – Floegipoky Dec 10 '13 at 15:43
  • @Floegipoky If they had 737,869,760,000,000 videos, there'd be a more or less 100% chance of a collision (in fact, many collisions), since that number is pretty close to 2^66. Due to the birthday paradox, in fact, they hit 50% chance of a collision from randomly assigned IDs at just 2^33 or about 8 billion videos. – Nick Johnson Dec 10 '13 at 18:44
  • @NickJohnson I'm well aware of a birthday paradox, however in this case the probability of ever experiencing a collision is far less relevant than the probability that a given generated id already exists. What you're saying is that at 2^33 videos there's a <= 50% chance that they will have had to generate 2^33 + 1 ids. What I'm saying is that it will be a long time before the number of collisions matters. – Floegipoky Dec 10 '13 at 19:05
  • So long as you have a mechanism to detect and correct collisions - which can be expensive in itself and in a distributed system hard to get right. – Tim B Dec 10 '13 at 19:13
  • @TimB The information that I've seen indicates that youtube uses MySQL to store metadata for videos. It would be fairly trivial to add a unique constraint to that column, which would ideally cause the upload to fail. The user would just have to try again. Easy enough. And before anybody starts going on about problems with sharding, if they used 64bit values for the ids (which I suspect is what they're actually doing) they could fit 137,438,950,000 video ids on a single 1TB drive. – Floegipoky Dec 10 '13 at 20:56
  • @Floegipoky Whatever you were trying to say, I was pointing out that the number you chose is in fact the total number of identifiers - so a 0.01% chance of collision is entirely inaccurate. – Nick Johnson Dec 11 '13 at 16:31
1

The issue with a hashing function over the full set of possible URLs, and then check it against an indexed database, is that it removes synchronization possibilities. Consider the amount of time it takes to upload a video, checking it against their database requires just about no time, that's not the issue. The same issue happens when you think about pre-calculating: that requires synchronization over a single point of access if you want to use distributed computers, which I'm sure they do. I think your third point is probably closest to correct, and then that ID is somehow encoded into a longer number for some reason (I'm actually not sure what the advantage of it is vs. an int value; anyone got a good reason?)

AaronB
  • 306
  • 1
  • 4
0

Here is way to do it efficiently and also make it seem random:-

  1. Make a hash table of M size as large as affordable.

  2. generate first M numbers at random using lookup in hash table.

  3. when exhausted do what algorithm suggest in the link below (sorry reusing solution to similar problem).

Generate unique phone numbers

Edit:- I know the given solution is for number but you can always translate numbers to symbols using simple mapping for each digit.

Community
  • 1
  • 1
Vikram Bhat
  • 6,106
  • 3
  • 20
  • 19
0

All this back and forth has caused me to do some more research regarding youtube's backend. Here's what I've come up with.

This leads me to believe that they're using MySQL to store video metadata. Some of the following will be dependent on the assumption that they are using a relational datastore.

I think that the base64 id with 11 characters is actually a base64-encoded 64 bit value. 64^11 = (2^6)^11 = 2^66, that's too close to 2^64 to be a coincidence.

I strongly suspect that a portion of this id comes from the id of the shard that the video's metadata is stored in. Let's say they devote 24 bits (16,777,216) to the shard id. They probably use this entire range, but they don't have 16 million shards. Instead, they probably assign each shard a range of these shard ids to simplify resharding. The shard id that a given video is assigned is probably pseudo-random. When a shard starts to fill up, they split it and update the ranges. Simple.

At least a portion of the remaining bits are probably an auto-incremented value local to the shard.

If there are any bits left after this, they are probably occupied by a pseudorandom number, a timestamp, or something similar. There is also the possibility that they include other implementation-specific data, but that would potentially cause big problems if they ever had to migrate so I suspect that they would shy away from this.

Community
  • 1
  • 1
Floegipoky
  • 3,087
  • 1
  • 31
  • 47