1

I'm reading this article on hashing in Java, and it says that

A hashing function may lead to collisions; that is, two or more keys are mapped to the same value.

In order to avoid this, chain hashing is used.

How come that hashing function can lead to the same value for different keys? I thought that a hash function always generates distinct values.

President James K. Polk
  • 40,516
  • 21
  • 95
  • 125
  • you can override the `hashCode()` function – jhamon Apr 18 '23 at 09:09
  • Hash code is just a an integer value and therefor there are 2^32 different possible values for a hash code. So for any object that has more than 2^32 possible states the hash code cannot be unique because there have to be collissions once you are forced to map more than 2^32 values to 2^32 different hash codes.. – OH GOD SPIDERS Apr 18 '23 at 09:13
  • 1
    It follows already purely out of the definition of what hashing does. A hash function is supposed to assign an arbitrary object a fix number. But there are less numbers available than possible object states, so at some point there have to be clashes. As a contrived example, suppose your hashcode has to spit out numbers between 1 and 10 only, but you have 10000 different possible objects. You cant distinguish all of them anymore. – Zabuzard Apr 18 '23 at 09:20
  • 1
    (Cryptographical) hash functions are generally supposed to be designed in a way that clashes are rather rare and hard to find, but its impossible to avoid them. That said, depending on the exact implementation, it can be quite easy to find clashes. For example Javas `String#hashCode` collides fairly easy (`"Aa"` collides with `"BB"`), see: https://stackoverflow.com/questions/9406775/why-does-string-hashcode-in-java-have-many-conflicts – Zabuzard Apr 18 '23 at 09:22
  • @OH GOD SPIDERS you answered first. I would mark your answer if it isn't a comment. – jachai.sotaro Apr 18 '23 at 09:24
  • Hash function return integer value. And it is expected to have collision. With SHA256 the risk is lowered. If you could get a unique 32 bit data, you would be able to encrypt anything down to 32 bit. Which is absurd – Popeye Apr 18 '23 at 09:26

3 Answers3

2

A hashing function is just that, a function, taking an input and producing an output. Take any function, for example

H(x) = x mod 10 

which is perfectly valid as a function. You'll notice that for inputs 5, 15, 25 and so on it produces the exact same outputs. The same is true for any other hashing function that produces results from a finite co-domain. In the case of Java, the co-domain is a value from the interval of 32-bit integers: [-2^31, 2^31 - 1].

There are only 2^32 possible outputs for a potentially infinite number of inputs. And based on the pigeon-hole principle, there must be at least two inputs that will produce the same output.

MC Emperor
  • 22,334
  • 15
  • 80
  • 130
QBrute
  • 4,405
  • 6
  • 34
  • 40
1

Let’s use the example of String. Its hashCode() returns a Java int (4-bytes or 32-bits) and is calculated as follows:

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

By this formula Aa and BB have the same hashCode, ie 2260 - see this answer

In general terms a hashCode has a 2^32 (4294967296) likelihood of being unique. So pretty good odds.

A cryptographic hash however is a much larger number, eg SHA256 is 2^256 (1.1579209e+77). There is practically no chance of a clash.

John Williams
  • 4,252
  • 2
  • 9
  • 18
1

Pigeonhole principle.

Hashes have the property that they are fixed size, generally, quite small. Java's hashCode() method is very small indeed: 32 bits. But, even something like SHA256 is 256-bits (hence the name).

Imagine it worked like you thought it worked: Unique things put into the hash algorithm result in unique hash values.

Then I have fixed world hunger and eternal peace, in fact, then I broke the universe. I have invented an infinite compression algorithm that can compress the entire university into 256 bits.

After all, if what you say is true, I can take 5GB movie file, hash it, and a unique value pops out. There are only 2^256 unique values which is a lot, but it only takes 256 bits (a mere 32 bytes) to uniquely represent this. I can just make a table mapping each of these values to the resulting file.

Obviously, that cannot be the case.

Let's use some managable numbers: Imagine I have a hash algorithm whose output is 4 bits only (so, 0, 1, 2, ..., 15 - just 16 unique numbers).

I want to uniquely hash the strings A through Z using this hash algorithm. If I then make a table:

Input Hash
A 0
B 1
C 2
.... ....
P 15

I have now hashed A through P to 0-15. So.. what does Q hash to? Whatever number you choose.. we now have a collision. Let's say we decide to hash Q to 2. Then 'I hashed a letter and 2 came out' means that the input was C or Q.

It doesn't matter how large your hashes get, as long as they are fixed size and the hash algorithm can hash input of any size (which is what all hash algorithms do: They hash ANY input to a fixed size hash), you run into this issue.

Hence, there will always be different inputs that hash to the same value.

This is no problem:

  • In cryptographically strong hash algorithms, it is computationally hard to produce input as will that hashes to some desired hash value. In other words, given some data that hashes to X, if I ask you to find me some different data that also hashes to X, your only real strategy is to just continually produce random inputs, hashing it, until you get real lucky. Yes, generally cryptographic hashes are at least 128-bit in size (generally 256 or even more), because that algorithm 'roll up random inputs and hash it until you get lucky' takes mere seconds if the hash values are sized less than 128 bits. However, 'it has 256 bit' is not equivalent to 'therefore it is cryptographically strong'. You need BOTH a large bit size AND an algorithm that has the property that it is mathematically not reversible.

  • Even in non-cryptographically strong hash algos, such as java's hashCode() which has no requirements that it is, this is no problem. A few collisions are fine; HashMap and co just deal with it. If you get a lot of collisions, HashMap's performance becomes bad, but it continues to work precisely as its API says it works. The rules for hashcode are simple: If 2 objects are equal they must hash to the same value. But the reverse is not required at all - 2 different objects may hash to the same value. No problem. In fact, for any input that has more than 2^32 unique values, this is as mathematically inevitable as 1 + 1 = 2.

This means this:

public int hashCode() {
  return 1;
}

is a valid implementation.

It's also an implementation that has far more collisions than necessary, so sticking objects with such hashCode definitions in hashmaps makes for very slow hashmaps.

rzwitserloot
  • 85,357
  • 5
  • 51
  • 72