3

The standard approach involves generating a unique id (smaller integer, usually an auto increment id), and then using that id in a bijective function to generate a smaller string as described here: https://stackoverflow.com/a/742047/762747

But this approach cannot work for distributed systems at scale. The id in case of NoSQL databases are generally much larger to ensure uniqueness. One can try to generate auto increment ids, but it is surely going to be inefficient.

Are there any other approaches to generating short URLs. Specifically:

1) How does twitter generate t.co URLs, as it is the best example I can think of when we are talking about scale. The tweet ids are much larger (they use snowflake), so we can say that twitter does not (and probably cannot) use auto increment ids.

2) In case they use the same approach, then is the URL shortening asynchronous, to ensure that they don't take a performance hit while generating the auto increment id and short url?

As for my specific case, I am trying to generate a unique shortened string from the mongo BSON id. The above approach yields a 16 character base 62 string when I try to shorten the BSON id. I can take the unique auto increment id route, but that sounds inefficient for obvious reasons. Thoughts? If twitter can do it, we can as well. Good folks @twitter, do you mind sharing your approach?

Community
  • 1
  • 1
amit_saxena
  • 7,450
  • 5
  • 49
  • 64
  • possible duplicate of [PHP URL Shortening Algorithm](http://stackoverflow.com/questions/3514057/php-url-shortening-algorithm) – Dávid Szabó Jul 30 '14 at 16:01
  • @newboyhun....that's the standard approach with auto increment ids, which will be inefficient in case of distributed systems at scale. – amit_saxena Jul 30 '14 at 16:06
  • @newboyhun what does that mean? And looks like it's irrelevant here (if you are talking about tech jargon here), and smells of ego ;) Additionally, I would request you to remove the duplicate question vote as the post you are pointing to doesn't answer this. Please read the question in totality, and if possible point me in the right direction. – amit_saxena Jul 31 '14 at 10:23

1 Answers1

1

The auto-incrementation is not a requirement, the requirement is ID uniqueness. You simply farm out contiguous blocks of IDs to be used by each server that issues new IDs. Those servers then auto-increment from the start of the block until the end of the block. The cross-server locking is done at the level of ID blocks and not individual IDs.

You can handle inevitable ID gaps due to servers coming-and-going by having a low-priority background scan of the database that gathers gaps in the IDs and add them to a "known-free" list of ranges to be farmed out prior to issuing new ID ranges.

Both the global auto-increment and localized auto-increment approaches are theoretically O(N) in the number of IDs. This merely demonstrates that sometimes, the algorithmic complexity is a poor indicator of expected performance.

Kuba hasn't forgotten Monica
  • 95,931
  • 16
  • 151
  • 313
  • Perfect answer. This is very simple after thinking about it. Each server instance can simply just increment by 100 or 1000 atomically, meaning those next 100 wont be used by any other server (whatever you chose) and use the same algorithm for each ID 100-200 or so, then when it's used that up, it can request the counter again and increment. By using a bigger number hardly any requests need be made so it can be fast. And yes the theoretical complexity is not very important in this case. It seems to be the best possible method. The algorithm to generate a shortened url is also uncomplicated. – jamylak Sep 06 '15 at 07:52