18

There are a lot of questions here on stackoverflow regarding URL shorteners as well as elsewhere on the internet, e.g.

How to code a URL shortener?

How do URL shortener calculate the URL key? How do they work?

http://www.codinghorror.com/blog/2007/08/url-shortening-hashes-in-practice.html

However, there is one thing I don't understand. For instance, http://goo.gl uses four characters at the moment. However, they claim their short URLs don't expire. As mentioned in the article on coding horror, if they can't recycle URLs, the only possible solution is at one point to add an additional character.

Ok, so far so good. With 4 characters that means about 15 million unique addresses. For something like Google Maps, I don't think that is very much and if you can't recycle, my guess is they run out of available addresses fairly quickly.

Now for the part I don't get. While handing out addresses, they start to run out of unused addresses. They have to check if a newly generated address has not been issued yet. The chance this has happened and the address is already in use increases. The straightforward solution of course is to generate a new URL over and over again until they find a free one or until they have generated all 1.5 million alternatives. However, this surely can't be how they actually do it, because this would be far too time-consuming. So how do they manage this?

Also, there are probably several visitors at once asking for a short URL, so they must have some synchronization going on as well. But how should the situation be managed when the fifth character needs to be added?

Finally, when doing some research on how the URLs from http://goo.gl work, of course I requested a short URL for a map on Google Maps several times. None of them will ever be used. However, when Google would strictly enforce the policy of URLs not expiring once issued, this means that there are lots and lots of dormant URLs in the system. Again, I assume Google (and the other services as well) have come up with a solution to this problem as well. I could imagine a clean up service which recycle URLs which have not been visited in the first 48 hours after creation or less then 10 times in the first week. I hope someone can shed some light on this issue as well.

In short, I get the general principle of URL shorteners, but I see several problems when these URLs cannot expire. Does anyone know how the problems mentioned above might be resolved and are there any other problems?


EDIT

Ok, so this blog post sheds some light on things. These services don't randomly generate anything. They rely on the underlying database's auto-increment functionality and apply a simple conversion on the resulting id. That eliminates the need to check if an id already exists (it doesn't) and the database handles synchronization. That still leaves one of my three questions unanswered. How do these services "know" if a link is actually used once created?

Community
  • 1
  • 1
Pieter
  • 3,339
  • 5
  • 30
  • 63
  • Goo.gl is currently using 5 characters when I try it, with mixed case and numbers. This would seem to allow for 1,934,917,632 possibilities (72^5) – MartW Jul 04 '12 at 10:05
  • On Google Maps (where I tried it), the shortener uses 4 characters. On http://goo.gl indeed I also get 5 characters. But this is still not much compared to the number of users they get on a daily basis even if only a fraction of them uses the service. So even with 5 characters, my guess is they are likely to run into problems. – Pieter Jul 04 '12 at 10:12
  • 1
    CodeByMoonlight 04 Jul 2012, 26*2 (low+up cases) + 10 (digits have just ONE case) = 62, NOT 72, which makes "only" 916 millions URLs. – Michel Merlin Mar 12 '13 at 17:41

1 Answers1

34

Why URL shorteners don't delete entries

I did write TinyURL (a decade ago) to give back an entry that I didn't need. Their reply made me realize how ridiculous I was: they told me "just create all the URLs you need". And the figures speak themselves:

A - With 26 low-case + 26 upper case letters + 10 digits (the choice of reasonable sites), using ONE character gives you 62 positions (i.e. 62 shortened URLs), then each additional char multiplies the position number by 62:

  • 0 char = 1 URLs
  • 1 char = 62 URLs
  • 2 chars = 3,844 (1 URL for each human in a village)
  • 3 chars = 238,328 (idem, in a city)
  • 4 chars = 14,776,336 (in Los Angeles area)
  • 5 chars = 916,132,832 (in Americas, N+Central+S)
  • 6 chars ~ 56,800,235,580 (8 URLs for each human in the world)
  • 7 chars ~ 3,521,614,606,000 (503 for each human, 4 for each web page in the world)
  • 8 chars ~ 218,340,105,600,000 (31,191 URLs for each human)
  • 9 chars ~ 13,537,708,655,000,000 (~2 million URLs for each human)
  • 10 chars ~ 839,299,365,900,000,000 (~120 billions URLs for each human)
  • 11 chars ~ 52,036,560,680,000,000,000

B - Actually the needs and uses are lower than what one could expect. Few people are creating short URLs, and each person creates few URLs. Original URLs are enough in most cases. Results are that the most popular shorteners, after years, still cover the needs today with just 4 or 5 chars, and adding another one when needed will cost nearly zero. Apparently goo.gl and goo.gl/maps are each using 5 chars, youtube is using 11 (using the 62 chars above, plus the dash and maybe a few others).

C - The cost of hosting (storing + operating) an URL is, say, $1000/year for 1Terabyte, with each TB able to contain 5 billions URLs, hence 1 URL costs 0.2 micro-dollar/year in hosting. However the benefit for the Shortener is also very thin, whence the business is not very strong. For the user, the benefit of an URL is hard to evaluate, however a missed link would cost far more than the hosting.

D - There is no point for the user to create a short URL if it risks to become inoperative in the next years, so persistency is a major appeal of a Shortener, and a serious Shortener will probably never cease to serve them unless they are forced out of business; yet this happened already, and anyway Short URLs have their load of drawbacks as well as benefits, as well explained in Wikipedia "URL shortening" (risks of all sorts of hacking, against the user, or the target sites, or the Shortener; e.g. one could attack a Shortener by bot-requesting giga-numbers of URLs, a threat surely fended off by most Shorteners).

Versailles, Tue 12 Mar 2013 20:48:00 +0100, edited 21:01:25

Community
  • 1
  • 1
Michel Merlin
  • 786
  • 1
  • 6
  • 13
  • While I agree with your numbers it does mean they have to keep increasing the number of characters. I guess 5 characters is still short, but can the same be said for 10? At some point they will have to increase to 10 or 11 chars if they store the urls indefinitely. Or is there another solution than just keep increasing the characters? – Pieter Mar 13 '13 at 08:09
  • 4
    @Pieter: Consider, for argument's sake: A group decides it wants to get all of TinyURL's 9-digit URLs for some freakish reason. They manage to acquire enough computing power to request a million URLs a second. (Ignore the fact that the service probably wouldn't even be able to handle that much traffic. Let's say in our reality, they dedicate massive amounts of money to providing free short URLs. :)) If you accept the numbers in the answer above, it would take 13537708655 seconds (~429 years) to send enough requests to exhaust the URL space and force an upgrade to 10 digits. – cHao Apr 30 '13 at 15:53
  • Hi, I programming a URL shortener and have a problem. One problem is, we can't create `6 chars` when user sent URL shortener request. Because might be I create random `6 chars` and this exist and uses oldest. One way is we create all `6 chars` and stored in table. When select one `6 chars` from table, delete this recode. But I can't store all `6 chars` in database because have heavy size. Is there a algorithm to generate unique `6 chars` without need to search on `URL shortener table` for ensure not exist or used? Thank you. – Milad Ghiravani Oct 23 '16 at 07:41
  • Hi again. I found the best algorithm for do this: http://www.lalit.org/lab/base62-php-convert-number-to-base-62-for-short-urls/ if your OS is 32 bit, this class support lower than interger `214748364` and if your OS is 64 bit, works well for integer lower than `9223372036854775807` – Milad Ghiravani Oct 24 '16 at 18:31
  • I think you made good arguments but you also did not really answer the question. The question itself is more technical oriented and is asking how do shortening services prevent collisions efficiently. – John Wu Jan 04 '18 at 15:02
  • "John Wu" 04 jan 15:02 « how do shortening services prevent collisions efficiently » The Shortener keeps a list of the SURLs (Short URLs) he has delivered. When you enter a new URL and request a SURL, the Shortener gives you the next in the list. For instance if the last delivered SURL is "www.shortener/chdsdiQsr04D", then he gives you "www.shortener/chdsdiQsr04E", so by construction there is absolutely no risk of collision. This very simple hence reliable process is possible BECAUSE of the numbers I listed. – Michel Merlin Jan 15 '18 at 14:22