I need a 4-character hash. At the moment I am taking the first 4 characters of a md5()
hash. I am hashing a string which is 80 characters long or less. Will this lead to collision? or, what is the chance of collision, assuming I'll hash less than 65,536 (164) different elements?
3 Answers
Well, each character of md5
is a hex bit. That means it can have one of 16 possible values. So if you're only using the first 4 "hex-bits", that means you can have 16 * 16 * 16 * 16
or 16^4
or 65536 or 2^16
possibilities.
So, that means that the total available "space" for results is only 16 bits wide. Now, according to the Birthday Attack/Problem, there are the following chances for collision:
50%
chance ->300
entries1%
chance ->36
entries0.0000001%
chance ->2
entries.
So there is quite a high chance for collisions.
Now, you say you need a 4 character hash. Depending on the exact requirements, you can do:
- 4 hex-bits for
16^4
(65,536) possible values - 4 alpha bits for
26^4
(456,976) possible values - 4 alpha numeric bits for
36^4
(1,679,616) possible values - 4 ascii printable bits for about
93^4
(74,805,201) possible values (assuming ASCII 33 -> 126) - 4 full bytes for
256^4
(4,294,967,296) possible values.
Now, which you choose will depend on the actual use case. Does the hash need to be transmitted to a browser? How are you storing it, etc.
I'll give an example of each (In PHP, but should be easy to translate / see what's going on):
4 Hex-Bits:
$hash = substr(md5($data), 0, 4);
4 Alpha bits:
$hash = substr(base_convert(md5($data), 16, 26)0, 4);
$hash = str_replace(range(0, 9), range('S', 'Z'), $hash);
4 Alpha Numeric bits:
$hash = substr(base_convert(md5($data), 16, 36), 0, 4);
4 Printable Assci Bits:
$hash = hash('md5', $data, true); // We want the raw bytes
$out = '';
for ($i = 0; $i < 4; $i++) {
$out .= chr((ord($hash[$i]) % 93) + 33);
}
4 full bytes:
$hash = substr(hash('md5', $data, true), 0, 4); // We want the raw bytes

- 163,128
- 34
- 264
- 314
-
Just a quick bug fix, for your '4 Alpha bits' solution, I think the second line should be: $hash = str_replace(range(0, 9), range('Q', 'Z'), $hash); – yosser Dec 06 '11 at 11:22
-
Come to think of it, the second one gives a-z + Q-Z as a range = 36 ^ 4 possible values. .. and the whole should read: $hash = substr(base_convert(md5($data), 16, 26),0, 4); $hash = str_replace(range(0, 9), range('q', 'z'), $hash); – yosser Dec 07 '11 at 17:25
Surprisingly high indeed. As you can see from this graph of an approximate collision probability (formula from the wikipedia page), with just a few hundred elements your probability of having a collision is over 50%.
Note, of course, if you're facing the possibility of an attacker providing the string, you can probably assume that it's 100% - scanning to find a collision in a 16-bit search space can be done almost instantaneously on any modern PC. Or even any modern cell phone, for that matter.

- 224,562
- 31
- 268
- 324
4 first characters contains 4*4 = 16 bits of data, so collision will be definitely at 65536 elements, and, due to birthday attack, it will be found much faster. You should use more bits of hash.

- 13,706
- 1
- 34
- 48
-
shouldn't you do 16^4 instead of 4*4 ? cause each characters have 16 variations and md5 uses Hex characters only – Neel Basu Jan 13 '11 at 15:58
-
1