53

I'm currently using MD5 hashes but I would like to find something that will create a shorter hash that uses just [a-z][A-Z][0-9]. It only needs to be around 5-10 characters long.

Is there something out there that already does this?

Update 1:

I like the CRC32 hash. Is there a clean way of calculating it in .NET?

Update 2:

I'm using the CRC32 function from the link Joe provided. How can I convert the uInt into the characters defined above?

Community
  • 1
  • 1
Arron S
  • 5,511
  • 7
  • 50
  • 57
  • 3
    I think you shouldn't use any short hash, so no truncated CRC32 either... – Arjan Jul 12 '09 at 21:22
  • 1
    TinyURL does not use hashes. What are you using your "hash" for? Are you trying to create a hash or a URL shortener; the two are different. – Dour High Arch Nov 28 '12 at 20:15

14 Answers14

65

.NET string object has a GetHashCode() function. It returns an integer. Convert it into a hex and then to an 8 characters long string.

Like so:

string hashCode = String.Format("{0:X}", sourceString.GetHashCode());

More on that: http://msdn.microsoft.com/en-us/library/system.string.gethashcode.aspx

UPDATE: Added the remarks from the link above to this answer:

The behavior of GetHashCode is dependent on its implementation, which might change from one version of the common language runtime to another. A reason why this might happen is to improve the performance of GetHashCode.

If two string objects are equal, the GetHashCode method returns identical values. However, there is not a unique hash code value for each unique string value. Different strings can return the same hash code.

Notes to Callers

The value returned by GetHashCode is platform-dependent. It differs on the 32-bit and 64-bit versions of the .NET Framework.

Vlad
  • 2,475
  • 21
  • 32
  • 1
    Short and sweet. Like .NET intended. – Piotr Kula Sep 14 '12 at 13:27
  • 13
    The only problem with String.GetHashCode is that it will generate different values on different platforms (32-bit vs. 64-bit). If you're expecting the hash code to be produced and consumed by different applications, you'll need to be careful. – Brenda Bell Oct 02 '12 at 16:20
  • 9
    As Brenda stated, GetHashCode() is different on 32 and 64 systems. And, is even different between .net 1.1 and 2.0 CLRs. But most importantly, GetHashCode() is not guaranteed unique! You can get the same hash from two different strings (I know, it happened to me in a production environment). – eduncan911 Nov 27 '12 at 23:55
  • GetHashCode() is not suitable for such tasks. It's not guarantied to have the same value in next .NET version. – Alex Kofman Feb 18 '14 at 10:57
  • 5
    This is a very bad idea, as the exact algorithm by which hash codes are generated for a given class is an implementation detail which should never be persisted, because it can change between .NET versions. In fact, it HAS changed between .NET versions. – SAJ14SAJ Apr 10 '15 at 17:48
39

Is your goal to create a URL shortener or to create a hash function?

If your goal is to create a URL shortener, then you don't need a hash function. In that case, you just want to pre generate a sequence of cryptographically secure random numbers, and then assign each url to be encoded a unique number from the sequence.

You can do this using code like:

using System.Security.Cryptography;

const int numberOfNumbersNeeded = 100;
const int numberOfBytesNeeded = 8;
var randomGen = RandomNumberGenerator.Create();
for (int i = 0; i < numberOfNumbersNeeded; ++i)
{
     var bytes = new Byte[numberOfBytesNeeded];
     randomGen.GetBytes(bytes);
}

Using the cryptographic number generator will make it very difficult for people to predict the strings you generate, which I assume is important to you.

You can then convert the 8 byte random number into a string using the chars in your alphabet. This is basically a change of base calculation (from base 256 to base 62).

Dan Atkinson
  • 11,391
  • 14
  • 81
  • 114
Scott Wisniewski
  • 24,561
  • 8
  • 60
  • 89
  • 2
    *"difficult for people to predict the strings you generate, which I assume is important to you"* -- aha, that might be true, given Arron's *"It only needs to be around 5-10 characters long"*. This would not be like TinyURL.com then, so it's about time Arron gives us some more details! – Arjan Jul 12 '09 at 23:38
17

I dont think URL shortening services use hashes, I think they just have a running alphanumerical string that is increased with every new URL and stored in a database. If you really need to use a hash function have a look at this link: some hash functions Also, a bit offtopic but depending on what you are working on this might be interesting: Coding Horror article

jörg
  • 3,221
  • 3
  • 20
  • 12
13

Just take a Base36 (case-insensitive) or Base64 of the ID of the entry.

So, lets say I wanted to use Base36:

(ID - Base36)
1 - 1
2 - 2
3 - 3
10 - A
11 - B
12 - C
...
10000 - 7PS
22000 - GZ4
34000 - Q8C
...
1000000 - LFLS
2345000 - 1E9EW
6000000 - 3KLMO

You could keep these even shorter if you went with base64 but then the URL's would be case-sensitive. You can see you still get your nice, neat alphanumeric key and with a guarantee that there will be no collisions!

KingNestor
  • 65,976
  • 51
  • 121
  • 152
7

You cannot use a short hash as you need a one-to-one mapping from the short version to the actual value. For a short hash the chance for a collision would be far too high. Normal, long hashes, would not be very user-friendly (and even though the chance for a collision would probably be small enough then, it still wouldn't feel "right" to me).

TinyURL.com seems to use an incremented number that is converted to Base 36 (0-9, A-Z).

Community
  • 1
  • 1
Arjan
  • 22,808
  • 11
  • 61
  • 71
  • Of course you *can*. Maybe you *shouldn't*, but it's perfectly possible. – Treb Jul 12 '09 at 21:09
  • You're very right indeed. :-) One *surely* shouldn't use a short hash in this situation though. I'll edit my answer and rewrite my *"You cannot create a short hash"*. – Arjan Jul 12 '09 at 21:18
5

First I get a list of random distinct numbers. Then I select each char from base string, append and return result. I'm selecting 5 chars, that will amount to 6471002 permutations out of base 62. Second part is to check against db to see if any exists, if not save short url.

 const string BaseUrlChars = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz";

 private static string ShortUrl
 {
     get
     {
         const int numberOfCharsToSelect = 5;
         int maxNumber = BaseUrlChars.Length;

         var rnd = new Random();
         var numList = new List<int>();

         for (int i = 0; i < numberOfCharsToSelect; i++)
             numList.Add(rnd.Next(maxNumber));

         return numList.Aggregate(string.Empty, (current, num) => current + BaseUrlChars.Substring(num, 1));
      } 
  }
Scott
  • 21,211
  • 8
  • 65
  • 72
Filix Mogilevsky
  • 727
  • 8
  • 13
  • 2
    I like how this gives you easy control over the characters, allowing you to exclude characters that are visually ambiguous, like 0, O, l, I, 1, etc. – Victor Stoddard Mar 31 '15 at 05:29
3

You can decrease the number of characters from the MD5 hash by encoding them as alphanumerics. Each MD5 character is usually represented as hex, so that's 16 possible values. [a-zA-Z0-9] includes 62 possible values, so you could encode each value by taking 4 MD5 values.

EDIT:

here's a function that takes a number ( 4 hex digits long ) and returns [0-9a-zA-Z]. This should give you an idea of how to implement it. Note that there may be some issues with the types; I didn't test this code.

char num2char( unsigned int x ){
    if( x < 26 ) return (char)('a' + (int)x);
    if( x < 52 ) return (char)('A' + (int)x - 26);
    if( x < 62 ) return (char)('0' + (int)x - 52);
    if( x == 62 ) return '0';
    if( x == 63 ) return '1';
}
Paul
  • 6,435
  • 4
  • 34
  • 45
  • See codymanix's answer, http://stackoverflow.com/questions/1116860/whats-the-best-way-to-create-a-short-hash-similiar-to-what-tiny-url-does/1117008#1117008 – Arjan Jul 12 '09 at 22:06
  • Hmmm, wouldn't the variable length make it hard to reverse the encoding? When invoking `num2char` multiple times for longer numbers, the result would need some separator between each encoded value, to tell them apart while decoding again. That makes the result much longer than when using a fixed-length encoding. If one doesn't mind using the + and / characters, then Base 64 encoding is easier I guess. – Arjan Jul 13 '09 at 09:48
  • According to the question, he's looking for some hash that's shorter than the MD5 that he's currently using, and that uses alphanumerics. So, the current hash is irreversible; I think that's a requirement, or at the least not a problem. And this doesn't have 'variable length' - you take 4 hex digits from the MD5 hash, then pass it to num2char. Then take the next 4, pass that number to num2char, etc. The MD5 hash has 32 hex digits. The string you get out of my algorithm uses 32/4=8 alphanumeric characters. – Paul Jul 13 '09 at 13:13
  • Of course, the MD5 is irreversible, but isn't the idea that *your* mapping should be able to decode back to that MD5 value? As for variable length: I was wrong indeed. (I thought 0 would yield "a0", while 25 would yield "a25", but that's obviously "a" and "z" -- don't know how I could be so confused.) However, returning "0" and "1" for 62 and 63 will yield duplicates from the 3rd if(..), right? Base 64 needs the + and / characters for a reason... ;-) (And I guess the 3rd `if` reads `(int)x - 52` instead?) – Arjan Jul 13 '09 at 15:04
  • hmmm. I didn't consider that *my* mapping should be decodable back to MD5... I do realize that returning "0" and "1" for 62 and 63 create possible duplicates, which could be a problem, but I was just outlining an idea here. If I can think of a better way, that's easy to interpret and/or elegant, I'll edit my post. Thanks for pointing out my error on the third if statement btw :) – Paul Jul 13 '09 at 19:45
2

You can use CRC32, it is 8 bytes long and similar to MD5. Unique values will be supported by adding timestamp to actual value.

So its will look like http://foo.bar/abcdefg12.

  • or, from another way, you can use alphabetical increment. The keys gonna be like this: /a, /b, ... /z, /a0, /aa, /ab, /ac, ... /az, /aba, /abb, /abc, ... –  Jul 12 '09 at 20:44
  • check this article - http://damieng.com/blog/2006/08/08/calculating_crc32_in_c_and_net –  Jul 12 '09 at 21:01
  • 1
    When prefixing or suffixing a timestamp to the hashed value, then what is the use of the hash? – Arjan Jul 12 '09 at 21:35
  • @Simeon Pilgrim: yes, he can use CRC32 with a timestamp if his collision expectations are low. A timestamp that includes microseconds alone may be enough to guarantee uniqueness. Ideally, a fast hash like MD5 would be better than CRC. – Victor Stoddard Mar 29 '15 at 22:11
  • @VictorStoddard if the collision expectations are low, he can use the last decimal digit, or the last bit. The point is you want zero collision. Because "expectations" and "will not happen" are not equal. – Simeon Pilgrim Mar 29 '15 at 22:35
2

If you're looking for a library that generates tiny unique hashes from inters, I can highly recommend http://hashids.org/net/. I use it in many projects and it works fantastically. You can also specify your own character set for custom hashes.

herostwist
  • 3,778
  • 1
  • 26
  • 34
0

If you don't care about cryptographic strength, any of the CRC functions will do.

Wikipedia lists a bunch of different hash functions, including length of output. Converting their output to [a-z][A-Z][0-9] is trivial.

Kevin Montrose
  • 22,191
  • 9
  • 88
  • 137
  • -1: a CRC only provides error checking, not unique collision avoidance. – Simeon Pilgrim Jul 12 '09 at 21:33
  • If you don't need cryptographic guarantees, they do a pretty good job for damn cheap in terms of CPU. – Kevin Montrose Jul 12 '09 at 21:40
  • but two urls will make the same CRC, and therefore have the same short-url, which is useless for a shorting service. – Simeon Pilgrim Jul 12 '09 at 21:46
  • Two urls could also conceivably make the same md5, or sha1, or sha256. In practice these are rare occurrences, but are possible for all hashing schemes given the pigeon-hole principle. More likely with a non-cryptographic hash than with one certainly, but its a case you have to handle regardless of hash function. – Kevin Montrose Jul 12 '09 at 21:55
  • That's why tinyUrl etc dont hash or crc the url, they just assign the next number. Really depends on what actually trying to be solved here. – Simeon Pilgrim Jul 13 '09 at 01:05
  • True, URL shortening services almost certainly publish a counter not a hash; but the question is for a hash function with a short output. – Kevin Montrose Jul 13 '09 at 01:45
0

You could encode your md5 hash code with base64 instead of hexadecimal, this way you get a shorter url using exactly the characters [a-z][A-Z][0-9].

codymanix
  • 28,510
  • 21
  • 92
  • 151
  • Though we do not know what Arron wants to use this for: *if* the URLs are to be entered by humans, then I would make them case insensitive without special characters. Base 36 does this (if the script on the server treats them as such). Unfortunately, Base 36 encoding yields longer URLs than Base 64, but they are less prone to errors. Again: *if* humans would need to type them. – Arjan Jul 12 '09 at 21:49
0

There's a wonderful but ancient program called btoa which converts binary to ASCII using upper- and lower-case letters, digits, and two additional characters. There's also the MIME base64 encoding; most Linux systems probably have a program called base64 or base64encode. Either one would give you a short, readable string from a 32-bit CRC.

Norman Ramsey
  • 198,648
  • 61
  • 360
  • 533
-1

You could take the first alphanumeric 5-10 characters of the MD5 hash.

M4N
  • 94,805
  • 45
  • 217
  • 260
  • 5
    That's not very unique. The following snippet of code shows that the sequence of numbers from 1 - 1000 has 30 collisions in the first 5 characters: `for f in `seq 0 10000` ; do md5 -s $f ; done | awk '{print substr($4, 0, 5)}' | sort | uniq -c | sort -n` – Stig Brautaset Jul 12 '09 at 21:17
  • 2
    Since he's looking for hash with a length of only 5 characters, I thought that uniqueness is not a strong requirement. – M4N Jul 12 '09 at 21:22
  • Well, referring to TinyURL.com suggest a 100% uniqueness requirement to me. So: no *short* hashes (or *any* hash if I'd program it). – Arjan Jul 12 '09 at 21:27
-2

If you need the hash to change on every call, you can do something like:

string hash = String.Format("{0:X}", DateTime.Now.GetHashCode());
viniciusalvess
  • 756
  • 8
  • 18