0

Possible Duplicate:
php short hash

MD5 hashes are long and inconvenient to use. How can I further encode a md5 string to produce a shorter string using a subset of characters, for example a-z, A-Z and 0-9? jsfiddle.net is doing something like this on their website to produce short links easy to copy and paste and look like this: http://jsfiddle.net/uY7Pk/

Community
  • 1
  • 1
Dziamid
  • 11,225
  • 12
  • 69
  • 104
  • 1
    **jsfiddle is not using those to implement security!** For security, MD5 itself is hardly long enough. Unless you don't really want *hashes*, but rather short *unique* identifiers? – Jon Sep 20 '11 at 11:20
  • 1
    "MD5 hashes are long and inconvenient to use" - no, they're not. – Widor Sep 20 '11 at 11:20
  • See the linked duplicate, you're looking for base36 encoding – David Snabel-Caunt Sep 20 '11 at 11:50

4 Answers4

2
  1. Generate a random N-character string.
  2. See if anything else already has that string as its shorturl in the database.
  3. If yes, go to 1. If no, store that string as the shorturl for the resource in the database.

There's no need to use hashing for url shorteners when you have a persistent datastore, because you're not actually encoding the long url, you're just associating a token with it.

Amber
  • 507,862
  • 82
  • 626
  • 550
  • Actually, url shorteners are usually numerical identifiers encoded in base 62 (not a typo), not randomly generated ones. – N.B. Sep 20 '11 at 11:50
  • 1
    Depends on whether the URL shortener cares about privacy or not. If the numerical identifiers are ordered, and someone manages to reverse the encoding, they can inspect any other item they want. If the numerical identifiers are random, then generating them and encoding them is just another way of 'generating a random N-character string'. – Amber Sep 20 '11 at 11:53
  • Performance-wise, it's much more efficient to encode the numerical identifier. You can inspect any item you want on most of url shorteners anyway, so I don't see what's it got to do with privacy. – N.B. Sep 20 '11 at 11:57
  • 1
    Performance-wise, it doesn't really matter - you can always have a database index on a shorturl column if you want to look things up quickly. "Most URL shorteners"? Hardly. Many of the sites *explicitly* designed to shorten URLs let you browse what has been shortened, but there are plenty of sites with built-in shorturls (example: pastebin.com) which give the option of 'private' shorturls which are not publicly listed nor sequential. – Amber Sep 20 '11 at 12:01
  • It does matter when you have several millions of entries. It's much easier to be sure that whatever you enter is unique rather than loop until you get something unique, and we can safely assume that even less-popular URL shorteners have millions of shortened URLs. – N.B. Sep 20 '11 at 12:14
  • 1
    Generation of a shorturl is not necessarily where the efficiency concern is on many sites. But it's also easy to just get the benefits of both: generate two separate parts of the shorturl; one of which is a random string, the other of which is an encoded incrementing number. The first portion removes guess-ability, the second portion assures non-collision. – Amber Sep 20 '11 at 12:22
  • Web-scale is always about performance. Obfuscating an integer yields much better results than generating something randomly and trying to insert it repeating the process until it succeeds. Sure, for small scale sites it doesn't matter whether it's random or not. When you reduce possible number of unique entries then the random factor isn't playing into your favor. Saying exposing an integer is a privacy risk simply isn't true. Shooting yourself in the foot by creating something slow is (if you had the chance to create something more optimal). Just saying. – N.B. Sep 20 '11 at 12:42
  • @N.B. Who says it is slow? With high probability the generate-and-check-for-dup will succeed the first time. Adding a counter makes that probability 1. But if a single db access is too slow then a counter would work. Don't you still have to persist the counter somehow so that a restart won't reuse previous counter values? – President James K. Polk Sep 20 '11 at 12:51
  • Let's use English alphabet and shortened url with 4 chars. You have 26 * 26 * 26 * 26 * 26 possible combinations which is 11,881,376. Let's assume you managed to get 6 million of entries with random generator without collisions. Now, you have 6 mil. entries and you have about half the combinations left. What's the probability you'll get a collision randomly generating strings? It's 50%. You try to write to the db - it says unique constraint fails, you get back to random generator - with 50% of getting the unique string. So, let's see what's faster (continued in next comment). – N.B. Sep 20 '11 at 12:56
  • Trying to write to the db, invoking db connection and all the procedures involved in creating / checking a unique constraint or simply being sure you'll 100% get unique string (integer, encoded in different base)? Expand the random string to 6 characters, it's always better AND faster to have the integer displayed to user using an base-changing function. And that's simply a fact, no personal preference can change that. – N.B. Sep 20 '11 at 12:59
  • Most shortened URLs are not 4 characters of token. The average length I tend to see is 7 case-sensitive characters, which is 62 to the 7th, a little over 3.5 trillion. The chance of a collision with 6 million entries in a space of 3.5 trillion is about 1 in 600k, or less than 0.0001%. – Amber Sep 20 '11 at 13:00
  • (Also, please, if you don't understand or acknowledge the potential for privacy issues with URL shorteners, please don't continue trying to just blankly assert that they don't exist. We get it; you believe it's not an issue. Others of us disagree.) – Amber Sep 20 '11 at 13:06
  • Bit.ly - 6chars (got some 5 chars ones), goo.gl - 5 chars. We can argue all you want, I know for a fact they encode integers in a different base, it's not random string. – N.B. Sep 20 '11 at 13:06
  • I'm not the one making absolute assertions about "what *all* url shorteners" do. (Also note that even with 5 characters, a case-sensitive base-62 token has over 900 million possibilities, which makes the chance of collision with 6 million items less than 1%.) – Amber Sep 20 '11 at 13:08
1

JSFiddle is using an algorithm to shorten the URL, that has not much to do with MD5 hashing.

Since hash algorithms usually generate a certain length of Hex Data...

echo md5("Hello World");    
-> b10a8db164e0754105b7a99be72e3fe5

...there might be algorithms to compress those hashes slightly further, but they will certainly result in a binary blob - which ain't a specific subset of characters, for example a-z, A-Z and 0-9.

Sidenote: PHPass is currently a solid way to use hashes in php.

Bjoern
  • 15,934
  • 4
  • 43
  • 48
0

I think you're getting confused between encryption and URL-shortening.

What you see on JSfiddle isn't encryption, it's a unique id that refers to the page. Instead of using solely numerical IDs, they're using letters too which allows many more combinations from the same number of characters.

Widor
  • 13,003
  • 7
  • 42
  • 64
-1

use base64_encode($hash). It will produce an encode version which usualy is shorter.

Another way is to compute a crc32 which is a 32bit integer

But it depends on whether you need to recover your have afterwards.

Dan Bizdadea
  • 1,292
  • 8
  • 15
  • from the [manual](http://php.net/manual/function.base64-encode.php): *Base64-encoded data takes about 33% more space than the original data.* – Yoshi Sep 20 '11 at 11:20
  • I just need to store a secure unique token in the database that I can send to users conveniently. – Dziamid Sep 20 '11 at 11:21
  • -1: base64 encoding is of course always longer than the original. And you won't be able to recover the data anyway if you encode the *hash*. – Jon Sep 20 '11 at 11:22
  • echo base64_encode(md5('hello')) results in NWQ0MTQwMmFiYzRiMmE3NmI5NzE5ZDkxMTAxN2M1OTI=, which is no good – Dziamid Sep 20 '11 at 11:23