21

Now, this is not strictly about URL shortening, but my purpose is such anyway, so let's view it like that. Of course the steps to URL shortening are:

  1. Take the full URL
  2. Generate a unique short string to be the key for the URL
  3. Store the URL and the key in a database (a key-value store would be a perfect match here)

Now, about the second point. Here's what I've come up with:

ByteArrayOutputStream baos = new ByteArrayOutputStream();
DataOutputStream dos = new DataOutputStream(baos);
UUID uuid = UUID.randomUUID();
dos.writeLong(uuid.getMostSignificantBits());
String encoded = new String(Base64.encodeBase64(baos.toByteArray()), "ISO-8859-1");
String shortUrlKey = StringUtils.left(encoded, 6); // returns the leftmost 6 characters
// check if exists in database, repeat until it does not

Is this good enough?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Bozho
  • 588,226
  • 146
  • 1,060
  • 1,140
  • Out of curiousity, why bother with a UUID? Why not for example just generate 5 or so bytes from a Random instance? – President James K. Polk Jan 01 '11 at 14:06
  • 1
    I started wit a random / System.nanoTime / the mac address' bit then realized that uuid has all of these :-) – Bozho Jan 01 '11 at 14:19
  • @Bozho you may want to consider [Base32 encoding aka Crockford encoding](http://www.crockford.com/wrmg/base32.html) as it has some advantages like removing potentially ambiguous characters like 0 and the letter L. You will end up with a longer short URL but if you don't have billions it maybe worthwhile. – Adam Gent Mar 14 '13 at 19:13

2 Answers2

5

For a file upload application I wrote, I needed this functionality, too. Having read this SO article, I decided to stick with just some random numbers and check whether they exists in the DB.

So your aproach is similar to what I did.

Community
  • 1
  • 1
Uwe Keim
  • 39,551
  • 56
  • 175
  • 291
2

Well what do you mean by URL shortening?

There are very different techniques. Most websites, AFAIK, use the technique to just put the databse primary key (maybe in some encoded) form in the URL at some position where it can be parsed by a regular expression and just enhancing the rest with keywords.

Example from Amazon: http://www.amazon.de/Bauknecht-WA-PLUS-614-Waschmaschine/dp/B003V1JDU8/

You can enter anything in place of the name of the product, only the id at the end is important.

However you may want to keep your links clean and check if it's correct and do 301 forwarding to the real URL or put a canonical URL if a wrong URL turns up.

However:

If you want to do something like TinyURL, my answer is a definite no.

It's not good enough.

Well it depends.

It's not "secure". It would be pretty easy to guess URLs. A better approach would be using some cryptographic function like SHA-1/MD5.

When it comes to collisions I can't really tell. GUID was designed to have no collisions, but you are only using the first 6 characters. I don't know what exactly they represent in the algorithm. But it's definitely not optimal.

Why, however, don't you just use the database auto incrementing primary key? If security is important you also definitely have go to with more than 6 characters.

On a project I did I used something like

/database-primary-key/hash-of-primary-key-with-some-token-or-client-information/

This way I could directly look up the primary key in the database which was the fastest possible way but also could verify that the link was not found out by brute forced by the hash. In my case the hash was the SHA-1 sum of the client's secret token and the primary key.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
The Surrican
  • 29,118
  • 24
  • 122
  • 168
  • Why do you think it will be easy to guess URLs? I don't think so. The hash functions produce larger output than I need, so again I'd have to trim. A hashed DB primary key may be fine, but it is not necessary that the database has an option for that. Key-value stores don't – Bozho Jan 01 '11 at 19:45
  • For a URL shorterner, why does it matter if someone can guess a URL? Ultimately, they should be redirected to that page, and access will either be public (for a bog standard web page) or restricted by some other means. – Rob Jan 01 '11 at 20:00
  • depends on the use case @Rob. if so, why do any hashing at all and not just use an auto increment? i was just trying to make clear that the use case and requirements are not clear in the question. – The Surrican Jan 01 '11 at 20:04
  • Well, you qualified the statement with, "if you want to do something like TinyURL", which is the fairly bog standard URL shortening case. The rest of your post seemed to imply it was talking about something more akin to URL routing/rewriting, in which case; yes, you may want your application identifiers to be less guessable, but of course, you shouldn't rely on that as a security measure either. – Rob Jan 01 '11 at 20:23
  • so WHAT EXACTLY is the question? – The Surrican Jan 01 '11 at 20:45