Why libraries prefer to recalculate hashcode on immutable objects in edge case?

Question

In guava code (which I think one of examples of high quality code) I've found following fragment:

// If the cachedHashCode is 0, it will always be recalculated, unfortunately.
private transient int cachedHashCode;

public final int hashCode() {
   // Racy single-check.
  int code = cachedHashCode;
  if (code == 0) {
    cachedHashCode = code = element.hashCode();
  }
  return code;
}

So "If the cachedHashCode is 0, it will always be recalculated, unfortunately". Another example is JDK String.hashCode:

public int hashCode() {
    int h = hash;
    if (h == 0 && value.length > 0) {
        char val[] = value;

        for (int i = 0; i < value.length; i++) {
            h = 31 * h + val[i];
        }
        hash = h;
    }
    return h;
}

It also tries to calculate hash code once, but fails if hashCode of String is 0 (example of such string is "\0").

Simple solution to avoid such recalculations would be addition of extra check on calculation:

if (hash == 0) hash++;

Although it very slightly slows the hashCode calculation down in general case but this trick allows to avoid worst case scenario when it repeatedly (and slowly (for long string for example)) calculated again and again. Why it isn't used in guava ImmutableSet nor JDK String?

Edit

Recent Java 7 releases added custom String.hash32 implementation which contains handling of this special case:

       // ensure result is not zero to avoid recalcing
       h = (0 != h) ? h : 1;

possible duplicate of [Why doesn't String's hashCode() cache 0?](http://stackoverflow.com/questions/2310498/why-doesnt-strings-hashcode-cache-0) — , Dec 26 '13 at 06:03
The penalty of recalculating hash code for strings such as "\0" is probably so small that is isn't worth to put in college-style/contrived code such as "if (hash == 0) hash++;" — aquaraga, Dec 26 '13 at 06:03

Stephen C · Answer 1 · 2013-12-26T10:41:08.437

1

It is done that way to save space.

For example, if String didn't use the hash value zero to mean (not cached), then the String class would need an extra boolean flag to say that the hash value wasn't cached.

So the trade-off is a one-in-4-billion probability¹ of recalculating the hash every time versus one extra word for every string object.

^{1 - This assumed that one choses string objects randomly from the domain of all possible Java strings. Real programs don't work that way ... but the point is that the impact of recalculating the hashcode is unlikely to be significant, unless you deliberately engineer the application to that end.}

edited Dec 26 '13 at 10:41

answered Dec 26 '13 at 07:42

Stephen C

698,415
94
811
1,216

Aside from a desire to retain compatibility with any persisted string hash values, would there have been any problem with having the hash-code logic check whether the computed value from a non-empty string was zero and, if so, increment it? Depending upon the optimizer, that would boil down to 1-3 fast instructions, and would avoid a potentially disastrous worst-case behavior. – supercat Dec 26 '13 at 21:04
Those 1-3 instructions would be executed each time a (new) string is hashed. And the cost of hashing an empty string is not that large. (Disastrous is a massive overstatement!) The result of your proposed optimization would probably be a slowdown for most use-cases. – Stephen C Dec 26 '13 at 22:57
The cost of hashing an empty string is trivial, which is why having an empty string return zero would not be a problem. What can be disastrous is having repeated lookups of some large strings take orders of magnitude longer than lookups of others. – supercat Dec 26 '13 at 23:38
@supercat - But assuming that the string is chosen randomly, the probability is very, very low that you will get a bad case. The concern is with long strings that are *deliberately crafted* to have a hash value of zero ... in order to cause something to break. That is a special case, and it can be dealt with in other ways ... if the situation requires it. – Stephen C Dec 27 '13 at 00:49

Why libraries prefer to recalculate hashcode on immutable objects in edge case?

Edit

1 Answers1