2

Considering youtube video url (for example):

e.g. :

http://www.youtube.com/watch?v=-JVkaMqD5mI&feature=related

I'm talking about the -JVkaMqD5mI part. ( length=11)

lets calc the options :

a-z = 26     |
A-Z = 26     |_______ >    26+26+10+2 = 64 optional chars in 11 places  = 64^11 = 73786976294838206464
0-9 = 10     |
-_ = 2       |

Im still wondering , when they generate a new ID for a new video , do they still check if already exists ?

Im sure they have some list( db or cache) of the "already generated ID's" ... ( and if they do , do they aquire the db each time ? or in cache ? or...?)

Or do they rely on the 1.355252...e-20 chances which is almost 0.( but still !=0)

What is the best practice solutions for this kind of situations ?

Royi Namir
  • 144,742
  • 138
  • 468
  • 792
  • 1
    Not a reql question....god...When SO will stop register KIDS ....when? – Royi Namir Sep 23 '12 at 18:38
  • 1
    I have a solution to show you at this answer: http://stackoverflow.com/questions/8528720/is-there-a-simple-way-to-generate-a-un-duplicate-string/8528822#8528822 – Aristos Dec 23 '12 at 20:29

2 Answers2

6

Well, just because they are using an alphanumeric ID on the video, does not mean they are simply generating those characters at random. Just because that string looks like random garbage to you I assure you it is not random and there is lots of information hidden in there.

So quick answer: No, it is not feasable to generate a random sequence of letters then either a) hope for no collisions or b) check through possibly billions of records to see if you already have that.

Much easier to keep a central "last ID used" and have an algorithm that moves from "last ID used" to "next ID to use" in a way that is mathematically guaranteed to generate a previously unused ID. In the case of sequential ID numbers that formula is simply f(n+1) = f(n)+1 (eg. last ID used was 150, next one will be 151.. guaranteed unused so far) but you can devise your own formulas to suit your needs.

Radu094
  • 28,068
  • 16
  • 63
  • 80
  • 3
    *"check through possibly billions of records to see if you already have that."* Why is that unfeasible? – ypercubeᵀᴹ Sep 22 '12 at 13:46
  • most probably a B-tree with a few billion leafs will paginate and eventually have to hit the disk everytime you query it. This will make the operation a few orders of magnitude slower than my proposed alternative. Not saying it can't be done; just that it shoudn't be – Radu094 Sep 22 '12 at 19:04
  • @ypercube OK feasible to guess and check for collisions just silly. Use a generator that can create the next unique ID based on the last. E.G. Database Identity or NEWSEQUENTIALID. Why would you use an order n when you can use an order 0. – paparazzo Sep 22 '12 at 21:34
  • @Blam: Database `IDENTITY()` is sequential, and not a hash, like OP asked. I'm not familiar with how `NEWSEQUENTIALID()` achieves to generate a new unique value each time. Is the source code available? – ypercubeᵀᴹ Sep 22 '12 at 22:03
  • 1
    @ypercube Exactly Identity is not a hash. OP asked what the was that was? Based on inspection you cannot conclude -JVkaMqD5mI is a hash or a sequential ID (or a random ID). Based on statistics you can conclude an 11 length alphanumeric is not a reliable unique hash. Again why go O(n) when you can go O(0)? – paparazzo Sep 22 '12 at 22:13
  • @Blam: Where did I say that you should? Did you read this answer? *"No, it is not feasible to generate a random sequence of letters then either a)... or b)..."* Both a and b are unjustified. Radu094 does not mention the 88 bits and why that affects the hope or not for collisions. Neither he explains why it is unfeasible to check for duplicates. – ypercubeᵀᴹ Sep 22 '12 at 22:25
  • And I did not conclude that '-JVkaMqD5mI' is hash or random. The OP implies that in the question when he talks about chances. – ypercubeᵀᴹ Sep 22 '12 at 22:26
  • For nonsequential, yet deterministic and unique IDs, use `f(n+1) = (f(n)+a) mod b`. For coprime a and b, this will return a unique ID from the pool of `b`. With, say, b close to 2^64, even YouTube won't run out of IDs anytime soon. – Seva Alekseyev Sep 23 '12 at 18:20
  • @Seva: That works pretty fine, but only if you have exactly one producer of the sequence. GUIDs are usually meant to be produced by several and possibly isolated producers. – ypercubeᵀᴹ Sep 23 '12 at 21:19
  • 1
    @Blam: So, let me understand what you propose. Assuming that the GUID is generated in constant time (`O(1)`) and guaranteed (statistically) to be unique, we can skip checking uniqueness? Just to save a few milliseconds of hitting a unique index? If it is a column in a database table, won't it be stored anyway and certainly indexed? – ypercubeᵀᴹ Sep 23 '12 at 21:24
  • @ypercube Yes I will take O(1) over O(n) every time. And I think a site like youtube would do the same. On a large table it is more than just a few milliseconds. And a needless load a precious resource (the database). Why would ever generate an ID you have to test when you can generate an ID you don't have to test. And I would not use a GUID - would use a sequential ID so the database index does not fragment. At even 1 million records a lookup on an indexed column is 0.1 seconds (if not in memory). 1000 loads and that is a waste of 100 seconds on both the web and sql server. – paparazzo Sep 23 '12 at 22:33
  • @Blam The only id's you don't have to test are ones you get something else to test for you, for example using UNIQUE contraints. The problem is that your proof is only valid given a relatively narrow range of circumstances. What if the system that generates the identifier crashes badly during the day due to hardware failure and is restored to a point it was sometime in the past? Your proof of uniqueness goes "poof." It's better to check than to worry when you find duplicates.... – Chris Travers Sep 24 '12 at 08:23
  • @ChrisTravers System running normal is a narrow range? You don't think a unique generator can initialize itself on start up? E.G. SQL NEWSEQUENTIALID or Identity. – paparazzo Sep 24 '12 at 08:39
  • "running normal" is too narrow not to check if the data is important. Yes it can initialize on startup but what if you restore that initializated state from backup? The *only* things you don't have to test are those which are declaratively constrained. – Chris Travers Sep 24 '12 at 08:49
  • @ChrisTravers Simple the UniqueID initializes from the data? E.G. SQL Identity, NEWSEQUENTIALID, or GUID then the unique ID does restore with the data. – paparazzo Sep 24 '12 at 09:07
  • Ok, then you get concurrency issues. How much testing do you put your algorithm through? What is the cost of a bug that eliminates that uniqueness guarantee? The advantage to a numeric, incrementing id is that as the index paginates, the number of cached pages needed to verify that the next number up is not in the index is relatively small. With a GUID, or other string identifier, you may have to assume you have to keep the whole index in memory. So maybe the right answer is to make those unique-checking friendly... I never trust test suites where I can trust declarative constraints. – Chris Travers Sep 24 '12 at 09:11
  • @ChrisTravers Enough! The answer to how much testing do I put the algorithm through is zero. I used Microsoft SQL as an example. I don't run SQL through any testing before I use it. I blindly trust the SQL features Identity, GUID, and SEQUENTIALID. As for lecturing me on GUID if you read you will see where I recommend against it. And I use .NET GUID without testing it for statistical uniqueness. If you think inserts with unique identifies have to be O(n) to be reliable then have at it. – paparazzo Sep 24 '12 at 10:32
0

For those purposes, usually something called a hash function is used. It creates a fixed-length data or strings from some other data, which can be any given length or type. It uses some algorithm for that. One example is the one you gave, encoding letters to numbers.

Hash functions are not that simple as they look. There can be a serious mathematical method behind them, and you can try to prove that they are perfect or minimal perfect (which is not that important for this example).

A perfect function is a hash function that cannot generate the same output for any two different imputs. If you have that kind of hash function, you do not have to check for duplicates. If you want to do that, you have to prove that your hash function is perfect.

Aleksandar Stojadinovic
  • 4,851
  • 1
  • 34
  • 56
  • 1
    just wanted to add that in the presented case, the hash function needs to be bijective aswell.Since one is using the hash in the url to retrieve the video, the servers need to reproduce the original inputs given the hash from the url – Radu094 Sep 22 '12 at 19:09
  • How can you conclude -JVkaMqD5mI is a hash. MD5 and GUID are both 32 hexidecimal / 128 bit for statistical uniqueness. I am not buying that youtube has an algorithm for a unique hash that is 11 alphanumeric. – paparazzo Sep 22 '12 at 20:31
  • I did not conclude, it is just an presumption. I don't see anything impossible in it, really. – Aleksandar Stojadinovic Sep 22 '12 at 23:03
  • 1
    Partially correct. A perfect hash function *does* generate the same output for two different inputs, but it guarantees that the number of those will be constant. Also, it comes with the cost of using a lot of memory. – Sanja Sep 23 '12 at 14:46