37

Why does String.hashcode() have so many conflicts?

I'm reading the String.hashCode() in jdk1.6, below is the codes

public int hashCode() {
    int h = hash;
    if (h == 0) {
        int off = offset;
        char val[] = value;
        int len = count;

        for (int i = 0; i < len; i++) {
            h = 31*h + val[off++];
        }
        hash = h;
    }
    return h;
}

This looks quite confusing to me because it has so many conflicts; although it's not required to be unique (we can still rely on the equals()), but less conflicts means better performance without visiting the entries in a linked list.

Suppose we've two characters, then so long as we can find two strings matching below equation, then we will have the same hashcode()

a * 31 +b = c * 31 +d

It will be easy to conclude that (a-c) * 31 = d-b take an easy example is make a-c = 1 and d-b = 31; so i wrote below codes for simple test

public void testHash() {
    System.out.println("A:" + (int)'A');
    System.out.println("B:" + (int)'B');
    System.out.println("a:" + (int)'a');

    System.out.println("Aa".hashCode() + "," + "BB".hashCode());
    System.out.println("Ba".hashCode() + "," + "CB".hashCode());
    System.out.println("Ca".hashCode() + "," + "DB".hashCode());
    System.out.println("Da".hashCode() + "," + "EB".hashCode());        
}

it will print below results which means all the strings have the same hashcode(), and it's easy to do it in a loop.

A:65 
B:66
a:97
2112,2112
2143,2143
2174,2174
2205,2205

even worse, suppose we've 4 characters in the string, according to the algorithm, suppose the first 2 characters produce a2, the 2nd 2 characters produce b2; the hashcode will still be a2 * 31^2 + b2 thus, with a2 and b2 equal between 2 strings, we will get more strings with hashcode() conflict. such examples are "AaAa", "BBBB" and so on; then we will have 6 characters, 8 characters......

suppose most of the time we use characters in ascii table in a string which will be used in a hashmap or hashtable, then the choosen prime number 31 here is definitely too small;

one easy fix is to use a larger prime number (luckily, 257 is a prime number) which can avoid this conflict. of course, choose a too big number will cause returned int value to be overflowed if the string is very long, but i assume most of the time the string used as a key is not that big? of course, it could still return a long value to avoid this.

below is my modified version of betterhash() which can solve such conflicts easily by running the codes it will print below values, which is effective to solve this problem.

16802,17028
17059,17285
17316,17542
17573,17799

but why jdk does not fix it? thx.

@Test
public void testBetterhash() {
    System.out.println(betterHash("Aa") + "," + betterHash("BB"));      
    System.out.println(betterHash("Ba") + "," + betterHash("CB"));
    System.out.println(betterHash("Ca") + "," + betterHash("DB"));
    System.out.println(betterHash("Da") + "," + betterHash("EB"));
}

public static int betterHash(String s) {
    int h = 0;
    int len = s.length();

    for (int i = 0; i < len; i++) {
        h = 257*h + s.charAt(i);
    }
    return h;
}
Ragunath Jawahar
  • 19,513
  • 22
  • 110
  • 155
hetaoblog
  • 1,990
  • 5
  • 26
  • 34
  • 7
    It you want to get nitpicky, there are theoretically an infinite number of collisions, no matter how strong your hash. – Jeffrey Feb 23 '12 at 03:37
  • I believe you meant keys in a map instead of entries in a LinkedList. – LazyCubicleMonkey Feb 23 '12 at 03:50
  • 3
    If you want to claim your hash is 'better', you should actually procedurally hash, say, every string (at least ascii ones) under 8 or so characters, and then compare the number of collisions against the original. I'd imagine 8 characters is *much* closer to the average key size than 4 (if not still on the low side!). – Kitsune Feb 23 '12 at 03:51
  • Indeed, hetaoblog says that the new hash will overflow if the key is "very long;" in fact it will do so with only 5 characters. – Mark Peters Feb 23 '12 at 04:16
  • @MarkPeters thx for point this out, I did not realize this-_- just run a test and that happens as you said-_- – hetaoblog Feb 23 '12 at 04:32
  • I may be misunderstanding, but your better hash solution is to replace one constant factor with another (also both primes, so they have basically the same mathematical properties) and deem it better, because there are no collisions based on 4 arbitrary picked strings? What if I pick 4 arbitrary other strings? Suddenly the old version is better again.. There are better hashcodes out there and you can mathematically prove that they're better, but they're much more complex - and slower. – Voo Feb 23 '12 at 04:49
  • 21
    I'm leaving this comment as a complaint to those that closed this as a non-constructive question. The question "Why does String.hashcode() have so many conflicts?" really is one that deserves discussion in any technology Q&A. People need to understand the objective, factual answer to this question. I also think the answers on this question have done a nice job answering the objective question. I also think the poster provided a nice solution to the specific problem he faced. We all need to write domain-specific hash functions from time to time. – D. A. Mar 03 '14 at 02:46
  • I agree that this was closed and think it should stay closed. Apart from the first sentence, this is not really a question at all. It is someone presenting some (unconvincing) evidence and an alternative algorithm with pretty obvious flaws. And besides, even the literal question in the first sentence is actually a *rhetorical* question; i.e. one where the person posing the question is asking it for rhetorical effect, not because they need to know the answer. – Stephen C Aug 31 '19 at 03:31

4 Answers4

60

I just hashed 58 thousand English language words (found here), both all-lowercase and also with the first letter capitalized . Know how many collided? Two: "Siblings" and "Teheran" (an alternate spelling of "Tehran").

Just like you, I took a subdomain (in my case a likely one) of possible strings and analyzed the hashCode collision rate for it, and found it to be exemplary. Who is to say that your arbitrary subdomain of possible strings is a better choice to optimize for than mine?

The people that wrote this class had to do so knowing that they couldn't predict (nor therefore optimize) the subdomain in which their users would be using Strings as keys. So they chose a hash function which distributes evenly over the entire domain of strings.

If you're interested, here's my code:

Map<Integer, List<String>> collisions = Files.lines(Paths.get(System.getProperty("user.home")+ "/corncob_lowercase.txt"))
    .flatMap(word -> Stream.of(word, word.substring(0, 1).toUpperCase() + word.substring(1)))
    .collect(Collectors.groupingBy(String::hashCode))
    .entrySet()
    .stream()
    .filter(e -> e.getValue().size() > 1)
    .collect(Collectors.toMap(Map.Entry::getKey, Map.Entry::getValue));

System.out.printf("Number of collisions: %d%n", collisions.size());
collisions.forEach((hash, words) -> System.out.printf("%d: %s%n", hash, words));

Edit

By the way, if you're curious the same test with your hash function had 13 collisions compared to String.hashCode's 1.

Mark Peters
  • 80,126
  • 17
  • 159
  • 190
  • the current collisions are higher if you combine upper/lower more, it looks to me higher than 1/1000; i did a similar test based on your data; here're my steps, 1. download the file for lower case, 2. read the first 5000 lines (reading too many will cause OOM in my program), for each line, list all possible uppercase/lowercase combinations, such as AAA/AAa/AaA/Aaa/aAA/aAa/aaA/aaa, 3. record the total number of strings and different hashcodes using a Set 4. print the numbers; so i got 8021600 as total number, 8012142 as distinct hash values, 9458 as collision count – hetaoblog Feb 23 '12 at 08:24
  • 14
    @hetaoblog: But mixed case isn't representative of natural language, which is what I was going for. NoBOdy TYPEs lIKe THiS. – Mark Peters Feb 23 '12 at 14:02
  • 6
    @markPeters, that comment did not age well because of the [mocking Spongebob Meme](https://knowyourmeme.com/memes/mocking-spongebob) – cahilltr Jan 28 '20 at 15:16
18

I'm sorry, but we need to throw some cold water on this idea.

  1. Your analysis is way too simplistic. You seem to have cherry-picked a subset of Strings that is designed to prove your point. This is not evidence that the number of collisions is (statistically) higher than expected across the domain of all strings.

  2. Nobody in their right mind would expect String.hashCode to be highly collision free1. It is simply not designed with that in mind. (If you want highly collision free hashing, then use a crypto hash algorithm ... and pay the cost.) String.hashCode() is designed to be reasonably good across the domain of all Strings ... and fast.

  3. Assuming that you could state a stronger case, this is not the place to state it. You need to raise this issue with the people who matter - Oracle's Java engineering team.

  4. The current algorithm for String::hashCode has been part of the javadoc specification for String since Java 1.2. (And the algorithm almost certainly goes back to Java 1.0 and earlier.) If the algorithm was changed, it would be a breaking change for some applications. This is probably enough kill the idea.

  5. The Java engineering team are going to weigh up the advantages of such a change against the costs of implementing it, for them, and for every user of Java.

The costs to users would include dealing with various potential performance and security issues, and the migration of any stored data that has dependencies on hashcodes. Or the cost of having more legacy applications that cannot realistically be ported to the latest version of Java.


1 - "Highly collision free hashing", is an idea / term that I pulled out of the air for the purposes of this answer. Sorry. However, the gist is that the probability of a hashcode collision for 2 strings should be independent of how related they are. So for instance "AA" and "bz" are related by virtue of having the same length. Obviously, this idea needs more thought. And it is also obvious that "relatedness" in the sense I'm talking about is not measurable ... sort of like Kolmogorov Complexity.)

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
14

Collisions are inevitable when hashing. The hashCode() method returns an integer which is used as an index into an array which is a bucket for all objects with the same hash code. The equals(Object) method is used to compare the target object with each one in the bucket to identify the exactly matching object, if it exists.

Ultimately, the hashCode() method just needs to be fast and not too weak (i.e. causing too many collisions), where too weak is quite a fuzzy metric.

maerics
  • 151,642
  • 46
  • 269
  • 291
2

It is pretty efficient but also simple. All possible lower case (ASCII) words up to six letters or all numbers up to six digits long have a unique hashCode(). i.e. the hashCode is like a base 31 number. Using a larger number has its own issues. A 257 factor would leave every 8th bit not particularly random as all ASCII character have a 0 top bit. A larger factor would result in duplicate hashcodes for five and six digit/letter words.

What is perhaps the biggest problem if you can't change the hashing algorithm. Whatever approach you take, there can be a case where this is a very bad choice and its likely to be sub-optimal for your use case.

Perhaps the biggest problem is denial of service attacks making pathological cases, normally very rare quite common. For example, a way to attack a web server is to fill a cache with keys all with the same hashCode e.g. 0 which is calculated every time. This cause HashMap to degenerate into a linkedlist.

A simple way around this is to make the hash algorithm unknown, possibly changing. As its stands, the best could be to use a TreeMap (which supports custom Comparison, though the default would be fine in this case)

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130