37

I've been investigating the hashCode() methods in java and found the one for String class strange. The source code is as follows:

public int hashCode() {
    int h = hash;
    if (h == 0 && value.length > 0) {
        char val[] = value;

        for (int i = 0; i < value.length; i++) {
            h = 31 * h + val[i];
        }
        hash = h;
    }
    return h;
}

The code itself is quite straight forward. But I wonder what's the reason for calculating hash code this way?
Why choose 31?
Why start from 0 instead of value.length - 1?
Any guarantee that this would make hashcodes less possible to collide with each other?

Shreyos Adikari
  • 12,348
  • 19
  • 73
  • 82
HarryLv
  • 389
  • 2
  • 4
  • 7
  • 2
    Check this answer: http://stackoverflow.com/questions/113511/hash-code-implementation – NilsH Mar 20 '13 at 08:20
  • 3
    And this http://stackoverflow.com/a/299748/305142 – Serkan Arıkuşu Mar 20 '13 at 08:22
  • The reason for the number 31 is because it's a prime number to avoid collision. – Hamid Dec 13 '18 at 03:14
  • Nothing to do with the subject, BUT, why oracle uses this kind of code "int h = hash"?? Why not perform the operations directly on hash?? I see this all over oracle code!!! I don't understand this... "char val[] = value" puzzles me EVEN MORE!!! – marcolopes May 14 '20 at 04:23

1 Answers1

10

Yes the probability of hashcode collision is very low as for example in case of String it depends upon the string value. If we are not creating any String with new operator then if the a new String has the same value that already present , then new String object is not created, it refers to the old value from heap and in this case only the value of hashCode will be same as expected.

The general contract of hashCode is:

Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.

From Java 1.2, java.lang.String class implements its hashCode() using a product sum algorithm over the entire text of the string.[2] Given an instance s of the java.lang.String class, for example, would have a hash code h(s) defined by

h(s)=s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

where terms are summed using Java 32-bit int addition, s[i] denotes the ith character of the string, and n is the length of s.

For your reference in Apache Harmony the method hashCode is:

public int hashCode() {
    if (hashCode == 0) {
        int hash = 0, multiplier = 1;
        for (int i = offset + count - 1; i >= offset; i--) {
            hash += value[i] * multiplier;
            int shifted = multiplier << 5;
            multiplier = shifted - multiplier;
        }
        hashCode = hash;
    }
    return hashCode;
}
Shreyos Adikari
  • 12,348
  • 19
  • 73
  • 82
  • 2
    It seems curious that they were willing to change the hash code implementation in 1.2, but have not since been willing to add something like `hashCode = (hash==0) ? count+1 : hash;` so as to avoid having repeated calls to `hashCode()` take excessively long with certain strings. The existing implementation doesn't cause such slowdowns with many strings, but any string which ever causes the slow behavior will always cause it. – supercat Jul 10 '13 at 00:04
  • @supercat: your approach would work if there is always only one string with the same content. Java mostly interns Strings, but you can still have two copies having the same characters. The hashCode method is supposed to be consistent with equals(), so your approach is not valid. It would e.g. break behaviour of HashMap and HashSet (contains, remove, etc. might fail when they shouldn't). – Peter Becker Oct 16 '13 at 23:17
  • 2
    @PeterBecker: Perhaps I wasn't clear what I was proposing? Any particular sequence of characters would always return the same hash value under my proposal; the only change would be that strings which under the existing algorithm would hash to zero would yield to a value that depended upon the number of characters in the sequence (which would always be the same for any any particular sequence). What's problematical as it turns out isn't hash sets, but rather switch statements. If a string in a switch statement would hash to zero, such assumption will be hard-wired into the compiled code. – supercat Oct 17 '13 at 15:00
  • @PeterBecker: Thus, a switch statement would assume any string which cached a `hashCode` value that wasn't zero could never trigger a switch statement for a string which under the old hashCode method would yield zero. There are other approaches which could be used to allow such strings' hash codes to be cached while still returning zero and without requiring extra fields, but they would slow down some other string operations. For example, one could specify that if `count` is -1, then the "real" count is stored in `hash` and the `hashCode` value should return as zero. – supercat Oct 17 '13 at 15:03
  • Downvoter, can you please add comment. – Shreyos Adikari Mar 06 '17 at 17:02
  • It would be good to clarify that this hashCode implementation in String is not just a hidden implementation detail, but actually part of the contract. – Samuel Edwin Ward Feb 06 '18 at 19:55