0

Possible Duplicate:
php short hash

I need to generate a short hash. The shortest possible from urls say under 6 characters.

I need them to be unique just for the same domain, so a hash from

www.example.com/category/sth/blablabla must be different than one from www.example.com/category2/sth/blabla but not from: www.example2.com/category/sth/blablabla

Would using md5($url) and then picking some 5 characters out of that result (for example the first, last, middle and 2 other characters) give and unique id?

Would this abbreviated hash be unique as well?

Community
  • 1
  • 1
f1ames
  • 1,714
  • 1
  • 19
  • 36
  • If the path name is unbounded in your problem, then there cannot be any bounded unique hash, simply by counting (colourfully called the "pidgeon-hole principle"). – Kerrek SB Nov 16 '11 at 15:23
  • If you're trying to minimize the length of the hash (in characters, or bytes?), you must maximize the uniqueness per character. Can you use UTF-8/unicode characters? – Brian Cain Nov 16 '11 at 15:28

4 Answers4

1

A hash is not unique by definition. It's mathematically impossible to get a unique hash for something longer than the hash, unless it does not vary fully, which is the case for URLs but you cannot exploit it generally. Alternatively, you could use a simple incrementing ID, but that won't allow you to recognize matching URLs.

Either use a really long hash (at least 10 characters, ideally using upper and lower case letters), or accept collisions and handle them appropriately. Which is how actual hash tables work.

Michael Borgwardt
  • 342,105
  • 78
  • 482
  • 720
  • "you could use a simple incrementing ID, but that won't allow you to recognize matching URLs", yeah I need to match urls so I can't use this method. I think short tags and checking for collision could be the solution – f1ames Nov 16 '11 at 16:03
1

For low probability of collisions you can use universal hashing techniques. For example, choose a prime number P. Then for each character of the URL choose a random in the interval [0, P). Compute the hash of the URL as SUM(a[i]*c[i]) mod P, where c[i] is a character in the original URL. Then take the string containing the digits of the obtained integer as the hash.

Read more in this paper: http://www.cs.cmu.edu/~avrim/451/lectures/lect0929.pdf.

Tudor
  • 61,523
  • 12
  • 102
  • 142
0

Yes, a small change in a URL will change pretty much every character in a good hash. MD5 or SHA1 is probably fine for this. Hence, take the first X characters - and you won't get any improvement by choosing the last X characters, or the first/last/middle. They're all good!

Obviously the more characters you put in your partial hash, the less likely you are to get collisions.

halfer
  • 19,824
  • 17
  • 99
  • 186
  • (If www.example.com is the same as www.example2.com, then search-and-replace the latter to the former before hashing.) – halfer Nov 16 '11 at 15:27
-1

I would try using crc32($url); it will give an integer usually 10-11 digits-long, could be a negative value, but still it will be shorter than 32 chars for md5.

The only problem is that crc32 is not 100% unique, but it's very unlikely that two different URLs will end up with the same checksum (but still there is a possibility).

halfer
  • 19,824
  • 17
  • 99
  • 186
Dmitri Snytkine
  • 1,096
  • 1
  • 8
  • 14