Two-way encoding function that returns short hashes

Question

We are building a URL shortening functionality for our site.

What we've come up at the moment:

We take a URL (http://www.google.com) and sha1 it and end up with a 40 character hash (738ddf35b3a85a7a6ba7b232bd3d5f1e4d284ad1).
We take the sha1 hash and encode it into base62 (basically A-Z, a-z, 0-9) and end up with a 28 character hash (jNMYchEoche67ro1k5gsCcHfDzmR) that we can decode back to the original sha1.

The reason we use sha1 is to make sure users cannot guess the next URL from the current / past URL(s).

The reason we use base62 is to make the URL valid and readable to the user.

Now a 28 character "short-url" that will be appended to our domain (http://www.google.com/r/jNMYchEoche67ro1k5gsCcHfDzmRis) is a bit too long, especially when you consider Twitters character limit.

What we're currently considering is cutting the sha1 down by around 20 characters, which would produce a 14 character short-url, but going down any further and we're worried we will run into collisions too quickly.

We also thought about Compressing big number (or string) to small value but that would require us to split the 28 or 14 character hash into pieces of 2 and sort the pieces and we have no idea how to return to the original hash from there.

Does anyone have any idea on what we could do? We would prefer a solution where we do not depend on the database to construct the URL, but if a DB is required please keep in mind that we're limited to Redis / MongoDB (which means no auto-increment integer field).

Note that you cannot "compress" arbitrary URLs to a fixed length; you will need some kind of lookup (basically a DB). So is your question really "How do I generate unique short pseudo-random strings?"? — Oliver Charlesworth, Jan 31 '13 at 16:01
Uhm, you cannot reverse a sha1 hash anyway. So how are you going to know what the original URL is? — nneonneo, Jan 31 '13 at 16:02
It appears so, but we're willing to compromise for collisions in exchange for shorter URLs. — Peeter, Jan 31 '13 at 16:02
Ok, in that case, there are lots of existing questions on this topic: http://stackoverflow.com/search?q=generate+unique+strings ;) — Oliver Charlesworth, Jan 31 '13 at 16:03
"willing to compromise for collisions" sounds like a one-way-ticket to hell. at least compared to what you gain here: nothing. don't reinvent the wheel and use one of the many shortening APIs out there or ways to generate correctly working shortened urls with your app (using a DB for example). — mark, Jan 31 '13 at 16:05
@nneonneo We will store the hash as a key in redis (key-value pairs). We will depend on our database to match the hash back to a URL. We're hoping to create the hash without having to query the database. — Peeter, Jan 31 '13 at 16:05
In that case start [here](http://en.wikipedia.org/wiki/List_of_hash_functions) for a list of hash algorithms. Just pick one that fits your needs. You have anything from 4 bits upwards there. — OldCurmudgeon, Jan 31 '13 at 16:07
@nneonneo: But the question is how do you generate a unique string without interrogating the DB? But this is a solved problem, as in my previous comment ;) — Oliver Charlesworth, Jan 31 '13 at 16:11
generate random string, try to insert, if that fails, try another random string. if the string is long enough and you have a good enough random source, probability dictates you do only slightly more than 1 insert on average. — nneonneo, Jan 31 '13 at 16:15
@mark We've considered using existing APIs, but unfortunately they wont work for us. The usual approach of a hashing a unique auto-incremental ID as most url-shorteners do won't work since we do not have that ID. nneonneo Querying and inserting are two rather different things. We would prefer not to loop and query the database for every short URL we create (especially since we have user-generated content with one page having a ridiculous amount of subpages that all require a short url) — Peeter, Jan 31 '13 at 16:19
So, if you have user generated content, how would you know that the user generated original urls would be unique? If user A makes a page "Lovely Crawfish Recipe" and user B makes the same "named" page, how would you know that the content doesn't already exist? Any kind of hashing is irrelevant at that point, because you have two different resources identically named. — JayC, Jan 31 '13 at 17:09
Don't worry about URL length and Twitter - it shortens long URL's anyway. You'll just have the user ping through two URL shorteners before reaching the actual URL. — Indrek, Jan 31 '13 at 18:09

talkol · Answer 1 · 2013-01-31T16:34:16.353

I'm not sure I understand all of your requirements, but here's what I have in mind..

Cutting down the sha1 seems like the correct approach.

If you "register" every short URL in your DB, you can avoid collisions by trying to allocate an alternate short URL on a collision (if the hash is already found in your DB, you have a collision).

It would work like this:

Try to allocate a new hash, cut the sha1 as much as you want, we have HASH1 as result
Check for collision in the DB, no collision, register HASH1 in the DB and finish
If collision, try to allocate a new hash, for example by cutting the sha1 by one less character (resulting in a longer hash), we have HASH2 as a result
Check collision.. (step 2) and so forth

Every time you want to lookup the correct long URL for a hash, you would have to consult your DB of course. I guess that's what you're already doing now since sha1 is irreversible.

How far should you cut down the sha1 initially? I would say as much as possible as long as you satisfy your requirement that it's going to be hard to guess the next url. I would say leaving only 5 bytes of sha1 (that's 40 bits) is going to be hard as hell to guess.. (if you have 1 million short-URLs in your DB it's still going to be 1 in a million guess)

Two-way encoding function that returns short hashes

1 Answers1