Why does String.hashcode() have so many conflicts?
I'm reading the String.hashCode() in jdk1.6, below is the codes
public int hashCode() {
int h = hash;
if (h == 0) {
int off = offset;
char val[] = value;
int len = count;
for (int i = 0; i < len; i++) {
h = 31*h + val[off++];
}
hash = h;
}
return h;
}
This looks quite confusing to me because it has so many conflicts; although it's not required to be unique (we can still rely on the equals()), but less conflicts means better performance without visiting the entries in a linked list.
Suppose we've two characters, then so long as we can find two strings matching below equation, then we will have the same hashcode()
a * 31 +b = c * 31 +d
It will be easy to conclude that (a-c) * 31 = d-b
take an easy example is make a-c = 1 and d-b = 31;
so i wrote below codes for simple test
public void testHash() {
System.out.println("A:" + (int)'A');
System.out.println("B:" + (int)'B');
System.out.println("a:" + (int)'a');
System.out.println("Aa".hashCode() + "," + "BB".hashCode());
System.out.println("Ba".hashCode() + "," + "CB".hashCode());
System.out.println("Ca".hashCode() + "," + "DB".hashCode());
System.out.println("Da".hashCode() + "," + "EB".hashCode());
}
it will print below results which means all the strings have the same hashcode(), and it's easy to do it in a loop.
A:65
B:66
a:97
2112,2112
2143,2143
2174,2174
2205,2205
even worse, suppose we've 4 characters in the string, according to the algorithm, suppose the first 2 characters produce a2, the 2nd 2 characters produce b2;
the hashcode will still be a2 * 31^2 + b2
thus, with a2 and b2 equal between 2 strings, we will get more strings with hashcode() conflict.
such examples are "AaAa", "BBBB" and so on;
then we will have 6 characters, 8 characters......
suppose most of the time we use characters in ascii table in a string which will be used in a hashmap or hashtable, then the choosen prime number 31 here is definitely too small;
one easy fix is to use a larger prime number (luckily, 257 is a prime number) which can avoid this conflict. of course, choose a too big number will cause returned int value to be overflowed if the string is very long, but i assume most of the time the string used as a key is not that big? of course, it could still return a long value to avoid this.
below is my modified version of betterhash() which can solve such conflicts easily by running the codes it will print below values, which is effective to solve this problem.
16802,17028
17059,17285
17316,17542
17573,17799
but why jdk does not fix it? thx.
@Test
public void testBetterhash() {
System.out.println(betterHash("Aa") + "," + betterHash("BB"));
System.out.println(betterHash("Ba") + "," + betterHash("CB"));
System.out.println(betterHash("Ca") + "," + betterHash("DB"));
System.out.println(betterHash("Da") + "," + betterHash("EB"));
}
public static int betterHash(String s) {
int h = 0;
int len = s.length();
for (int i = 0; i < len; i++) {
h = 257*h + s.charAt(i);
}
return h;
}