Pigeonhole principle.
Hashes have the property that they are fixed size, generally, quite small. Java's hashCode()
method is very small indeed: 32 bits. But, even something like SHA256 is 256-bits (hence the name).
Imagine it worked like you thought it worked: Unique things put into the hash algorithm result in unique hash values.
Then I have fixed world hunger and eternal peace, in fact, then I broke the universe. I have invented an infinite compression algorithm that can compress the entire university into 256 bits.
After all, if what you say is true, I can take 5GB movie file, hash it, and a unique value pops out. There are only 2^256 unique values which is a lot, but it only takes 256 bits (a mere 32 bytes) to uniquely represent this. I can just make a table mapping each of these values to the resulting file.
Obviously, that cannot be the case.
Let's use some managable numbers: Imagine I have a hash algorithm whose output is 4 bits only (so, 0, 1, 2, ..., 15 - just 16 unique numbers).
I want to uniquely hash the strings A through Z using this hash algorithm. If I then make a table:
Input |
Hash |
A |
0 |
B |
1 |
C |
2 |
.... |
.... |
P |
15 |
I have now hashed A through P to 0-15. So.. what does Q hash to? Whatever number you choose.. we now have a collision. Let's say we decide to hash Q to 2. Then 'I hashed a letter and 2 came out' means that the input was C or Q.
It doesn't matter how large your hashes get, as long as they are fixed size and the hash algorithm can hash input of any size (which is what all hash algorithms do: They hash ANY input to a fixed size hash), you run into this issue.
Hence, there will always be different inputs that hash to the same value.
This is no problem:
In cryptographically strong hash algorithms, it is computationally hard to produce input as will that hashes to some desired hash value. In other words, given some data that hashes to X, if I ask you to find me some different data that also hashes to X, your only real strategy is to just continually produce random inputs, hashing it, until you get real lucky. Yes, generally cryptographic hashes are at least 128-bit in size (generally 256 or even more), because that algorithm 'roll up random inputs and hash it until you get lucky' takes mere seconds if the hash values are sized less than 128 bits. However, 'it has 256 bit' is not equivalent to 'therefore it is cryptographically strong'. You need BOTH a large bit size AND an algorithm that has the property that it is mathematically not reversible.
Even in non-cryptographically strong hash algos, such as java's hashCode()
which has no requirements that it is, this is no problem. A few collisions are fine; HashMap and co just deal with it. If you get a lot of collisions, HashMap's performance becomes bad, but it continues to work precisely as its API says it works. The rules for hashcode are simple: If 2 objects are equal they must hash to the same value. But the reverse is not required at all - 2 different objects may hash to the same value. No problem. In fact, for any input that has more than 2^32 unique values, this is as mathematically inevitable as 1 + 1 = 2
.
This means this:
public int hashCode() {
return 1;
}
is a valid implementation.
It's also an implementation that has far more collisions than necessary, so sticking objects with such hashCode definitions in hashmaps makes for very slow hashmaps.