Why doesn't String's hashCode() cache 0?

Question

I noticed in the Java 6 source code for String that hashCode only caches values other than 0. The difference in performance is exhibited by the following snippet:

public class Main{
   static void test(String s) {
      long start = System.currentTimeMillis();
      for (int i = 0; i < 10000000; i++) {
         s.hashCode();
      }
      System.out.format("Took %d ms.%n", System.currentTimeMillis() - start);
   }
   public static void main(String[] args) {
      String z = "Allocator redistricts; strict allocator redistricts strictly.";
      test(z);
      test(z.toUpperCase());
   }
}

Running this in ideone.com gives the following output:

Took 1470 ms.
Took 58 ms.

So my questions are:

Why doesn't String's hashCode() cache 0?
What is the probability that a Java string hashes to 0?
What's the best way to avoid the performance penalty of recomputing the hash value every time for strings that hash to 0?
Is this the best-practice way of caching values? (i.e. cache all except one?)

For your amusement, each line here is a string that hash to 0:

pollinating sandboxes
amusement & hemophilias
schoolworks = perversive
electrolysissweeteners.net
constitutionalunstableness.net
grinnerslaphappier.org
BLEACHINGFEMININELY.NET
WWW.BUMRACEGOERS.ORG
WWW.RACCOONPRUDENTIALS.NET
Microcomputers: the unredeemed lollipop...
Incentively, my dear, I don't tessellate a derangement.
A person who never yodelled an apology, never preened vocalizing transsexuals.

LOL! +1 for a fine example of trolling done the übergeek way! — Mike Nakis, Jul 25 '13 at 17:00
[things have changed since this question was asked](https://stackoverflow.com/a/64082954/1059372) — Eugene, Sep 26 '20 at 22:12

Kevin Bourrillion · Accepted Answer · 2011-10-11T18:15:46.490

59

You're worrying about nothing. Here's a way to think about this issue.

Suppose you have an application that does nothing but sit around hashing Strings all year long. Let's say it takes a thousand strings, all in memory, calls hashCode() on them repeatedly in round-robin fashion, a million times through, then gets another thousand new strings and does it again.

And suppose that the likelihood of a string's hash code being zero were, in fact, much greater than 1/2^32. I'm sure it is somewhat greater than 1/2^32, but let's say it's a lot worse than that, like 1/2^16 (the square root! now that's a lot worse!).

In this situation, you have more to benefit from Oracle's engineers improving how these strings' hash codes are cached than anyone else alive. So you write to them and ask them to fix it. And they work their magic so that whenever s.hashCode() is zero, it returns instantaneously (even the first time! a 100% improvement!). And let's say that they do this without degrading the performance at all for any other case.

Hooray! Now your app is... let's see... 0.0015% faster!

What used to take an entire day now takes only 23 hours, 57 minutes and 48 seconds!

And remember, we set up the scenario to give every possible benefit of the doubt, often to a ludicrous degree.

Does this seem worth it to you?

EDIT: since posting this a couple hours ago, I've let one of my processors run wild looking for two-word phrases with zero hash codes. So far it's come up with: bequirtle zorillo, chronogrammic schtoff, contusive cloisterlike, creashaks organzine, drumwood boulderhead, electroanalytic exercisable, and favosely nonconstruable. This is out of about 2^35 possibilities, so with perfect distribution we'd expect to see only 8. Clearly by the time it's done we'll have a few times that many, but not outlandishly more. What's more significant is that I've now come up with a few interesting band names/album names! No fair stealing!

edited Oct 11 '11 at 18:15

answered Feb 22 '10 at 18:41

Kevin Bourrillion

40,336
12
74
87

2

This is a very practical argument. Out of curiosity, though, is this kind of caching mechanism common elsewhere too? That is, if attempting to cache all values necessitates an extra flag, then it's best practice to just sacrifice one value to be uncacheable? – polygenelubricants Feb 22 '10 at 20:40
2

I'm sure I've used this trick a time or two. Of course, the demands on the String class are quite extraordinary compared to most classes. Nicely appropriate username btw :) – Kevin Bourrillion Feb 22 '10 at 20:49
20

Yes, I've been quite obsessed with String's hashCode() recently as evidenced by my username. Joshua Bloch in the July 23, 2007 Google Tech Talk video said that he found "polygenelubricants" among (200,000)^2 word pairs in 10 minutes. I took advantage of the hash function properties to do it in O(N) in just a few seconds. The following string, for example, also hashes to MIN_VALUE: `"And so my fellow mismanagements: ask not what your newsdealer can sugarcoat for you -- ask what you can sugarcoat for your newsdealer."` – polygenelubricants Feb 22 '10 at 21:00
6

If the strings are coming from users, then the probability is close to 1. You know someone's going to try it. – Antimony Aug 22 '12 at 18:25
1

I guess it may be worth it to polygenelubricants as servers takes longer to answer queries that forces them to recalculate the hash:) Note to self: choose username wisely – jontejj Mar 20 '13 at 13:48
"give every possible benefit of the doubt" Is it every possible benefit of the doubt? The scenario measures throughput. What if your usecase cares about latency? What about if the string being hashed is attacker-controlled and hence they can inject "bequirtle zorillo" into this latency-sensitive application? I'm not saying this will mean optimising the zero case becomes worthwhile, but it's more of a worst-case scenario than the one presented. – Aug 15 '16 at 12:11

Jon Skeet · Answer 2 · 2010-02-22T19:09:22.183

24

It uses 0 to indicate "I haven't worked out the hashcode yet". The alternative would be to use a separate Boolean flag, which would take more memory. (Or to not cache the hashcode at all, of course.)

I don't expect many strings hash to 0; arguably it would make sense for the hashing routine to deliberately avoid 0 (e.g. translate a hash of 0 to 1, and cache that). That would increase collisions but avoid rehashing. It's too late to do that now though, as the String hashCode algorithm is explicitly documented.

As for whether this is a good idea in general: it's an certainly efficient caching mechanism, and might (see edit) be even better with a change to avoid rehashing values which end up with a hash of 0. Personally I would be interested to see the data which led Sun to believe this was worth doing in the first place - it's taking up an extra 4 bytes for every string ever created, however often or rarely it's hashed, and the only benefit is for strings which are hashed more than once.

EDIT: As KevinB points out in a comment elsewhere, the "avoid 0" suggestion above may well have a net cost because it helps a very rare case, but requires an extra comparison for every hash calculation.

edited Feb 22 '10 at 19:09

answered Feb 22 '10 at 11:22

Jon Skeet

1,421,763
867
9,128
9,194

I just added a best-practice tag and a 4th question to make this more of a design question. Should it be this way? Should a non-zero probability of saving O(n) work every time the method is called (and it will be called plenty since Strings and hashCode() are such fundamental parts of Java) justify additional O(1) storage space? Or is it actually best-practice to generally just cache all but one value rather than having a flag? – polygenelubricants Feb 22 '10 at 11:32
@polygenelubricants: I've added an extra paragraph. I'm not currently convinced it's a good idea - I doubt that strings are rehashed *that* often. – Jon Skeet Feb 22 '10 at 11:40
This is dependent on your application context although I suspect that the additional overhead in recomputing the hash code for Strings who hash to 0 is going to be negligable. – Adamski Feb 22 '10 at 11:42
I put strings into HashSets and HashMaps (as keys) a lot. Even if I'm not putting them directly into hashing collections, it's a common enough type for fields of other objects too. The common algorithm to compute the hash code for a compound type is to formulate it in terms of the hash codes of its relevant fields. So I actually do think that it's worthwhile to cache the hash code for Strings, which is why Java does it in the first place. In fact, most immutable types probably should cache their hash code, which is ideally computed lazily. I would have guessed that this is standard practice. – polygenelubricants Feb 22 '10 at 12:21
It still bothers me that unless you're willing to pay for the additional cost in space (which is a constant per object), one value is practically uncacheable. Maybe you can work out the probabilities and expectations etc and deem this to be acceptable, but then I'm not sure if 0 is the best value to choose to be the "lost" one. The way I understand it, commonly used hash functions tend to evaluate to even values more often than odd values and/or to zero more than any other values. – polygenelubricants Feb 22 '10 at 12:26
If you're putting a string in a hash map, that will hash it once and keep the hash. If you then try to find a key within that map, you'd only benefit if you're looking using the exact same key object. I typically look for *equal* (but non-identical) keys, so it wouldn't help. – Jon Skeet Feb 22 '10 at 12:59
"I don't expect many strings hash to 0". Averaged over all possible Strings, one would expect `1` in `2**32` to hash to zero. – Stephen C Feb 22 '10 at 13:27
1

@Stephen C: That's assuming a perfectly distributed hash. I don't know whether that's the case for the one used by String. – Jon Skeet Feb 22 '10 at 13:59
1

"I don't expect many strings hash to 0". Well, not unless the strings were deliberately chosen. – Tom Hawtin - tackline Feb 22 '10 at 14:33
1

"Well, not unless the strings were deliberately chosen. " Well, "" is probably the most common String in the Java world (who even KNOWS how many Strings are inited to "" and never changed, right?) and "".hashCode() is 0. I can't see many use cases for, say, using "" as a map key, but I'm sure it happens, so this is probably disproportionately expensive. That said, "".hashCode() basically just executes a loop from 0 to 0, so I don't think it'll exactly be slow... and even if it was, who cares (see Kevin's answer) – Cowan Apr 30 '10 at 06:50
You mentioned that "the alternative would be to use a separate Boolean flag, which would take more memory", but I think the robustness and performance of the caching mechanism is more important than the memory issue. I explained more in my answer: http://stackoverflow.com/questions/2310498/why-doesnt-strings-hashcode-cache-0/3742939#3742939 – MB. Sep 18 '10 at 18:35
@Cowan: Were I writing specs for a framework, I would specify that the hashcode method should never return 0 "slowly". In cases where the time required to prevent a hashcode from returning zero would be on the same order as the time to perform computations which might yield zero, having it return zero would be fine. On the other hand, the time required to compute the hashcode of a string could be many thousands of times longer than the time required to replace zero with something else. – supercat Jan 26 '13 at 23:09
@JonSkeet but then why not using a negative number as the flag ? the String hashcode function does not seem to return one. – Sergio Feb 04 '14 at 07:26
1

@Sergio: Yes it does. For example, `"aaaaaa".hashCode()` returns -1425372064. – Jon Skeet Feb 04 '14 at 08:08
@RealSkeptic: Given that you know more about what's changed, perhaps you'd like to suggest an edit? – Jon Skeet Oct 19 '15 at 15:51

score 19 · Answer 3 · answered Sep 18 '10 at 18:30

I think there's something important that the other answers so far are missing: the zero value exists so that the hashCode-caching mechanism works robustly in a multi-threaded environment.

If you had two variables, like cachedHashCode itself and an isHashCodeCalculated boolean to indicate whether cachedHashCode had been calculated, you'd need thread synchronization for things to work in a multithreaded environment. And synchronization would be bad for performance, especially since Strings are very commonly reused in multiple threads.

My understanding of the Java memory model is a little sketchy, but here's roughly what's going on:

When multiple threads access a variable (like the cached hashCode), there's no guarantee that each thread will see the latest value. If a variable starts on zero, then A updates it (sets it to a non-zero value), then thread B reads it shortly afterwards, thread B could still see the zero value.
There's another problem with accessing shared values from multiple threads (without synchronization) - you can end up trying to use an object that's only been partly initialized (constructing an object is not an atomic process). Multi-threaded reads and writes of 64-bit primitives like longs and doubles are not necessarily atomic either, so if two threads try to read and change the value of a long or a double, one thread can end up seeing something weird and partially set. Or something like that anyway. There are similar problems if you try to use two variables together, like cachedHashCode and isHashCodeCalculated - a thread can easily come along and see the latest version of one of those variables, but an older version of another.
The usual way to get around these multi-threading issues is to use synchronization. For example, you could put all access to the cached hashCode inside a synchronized block, or you could use the volatile keyword (although be careful with that because the semantics are a little confusing).
However, synchronization slows things down. Bad idea for something like a string hashCode. Strings are very often used as keys in HashMaps, so you need the hashCode method to perform well, including in multi-threaded environments.
Java primitives that are 32-bits or less, like int, are special. Unlike, say, a long (64-bit value), you can be sure that you will never read a partially initialized value of an int (32 bits). When you read an int without synchronization, you can't be sure that you'll get the latest set value, but you can be sure that the value you do get is a value that has explicitly been set at some point by your thread or another thread.

The hashCode caching mechanism in java.lang.String is set up to rely on point 5 above. You might understand it better by looking at the source of java.lang.String.hashCode(). Basically, with multiple threads calling hashCode at once, hashCode might end up being calculated multiple times (either if the calculated value is zero or if multiple threads call hashCode at once and both see a zero cached value), but you can be sure that hashCode() will always return the same value. So it's robust, and it's performant too (because there's no synchronization to act as a bottleneck in multi-threaded environments).

Like I said, my understanding of the Java memory model is a little sketchy, but I'm pretty sure I've got the gist of the above right. Ultimately it's a very clever idiom for caching the hashCode without the overhead of synchronization.

You wouldn't necessarily need synchronization - as you mention, there are things like volatile. While you do indeed need to be careful with volatile, I think it's safe to say that the authors of the String class are likely to know how to use it properly, or have appropriate people to consult. I do take your point... but I'm still not really convinced that it's worth caching at all, and the memory cost is still present for every string in the system :( — Jon Skeet, Sep 18 '10 at 19:08
As I understand it, volatile is a form of synchronization, just it comes with less overhead than the synchronized keyword. I found this link http://www.cs.umd.edu/~pugh/java/memoryModel/DoubleCheckedLocking.html which, part way through, explains the idiom used in String's hashcode. I rather like it myself - I think I'll start using it more in fact :) Though I do appreciate your point about the memory - that could be an issue for some things. BTW String.intern() is a reason why multi-threaded performance is important for Strings - they can get re-used internally by the JVM. — MB., Sep 18 '10 at 19:28
That may be a good reason for regarding zero as special, but it's not a good reason for having a hash function that returns a non-cacheable value. Would there have been any difficulty having included in the hash function something like: `if (computedHash != 0) return computedHash; else return «some other function»;`? Even if the other function simply took the ASCII values of the first character in the string, plus 991 times that of the last characters in the string, and added 1234567890, that'd not jinx the distribution much. — supercat, Oct 31 '12 at 22:26
`if (computedHash != 0) return computedHash; else return «some other function»;` is effectively what _is_ in the `hashCode` function, just with care applied around what happens if it's called by multiple threads. You can take a look at the source. Multi-threading aside, it just means that, if the computed hash code is zero (which is very unlikely anyway), the hash code will be re-calculated each time the function is called. — MB., Oct 18 '13 at 16:32
[I agree with the first point](https://stackoverflow.com/a/64082954/1059372). @supercat it took quite a while, but they fixed this. — Eugene, Sep 26 '20 at 22:21

Adamski · Answer 4 · 2010-02-22T11:47:23.377

8

0 isn't cached as the implementation interprets a cached value of 0 as "cached value not yet initialised". The alternative would have been to use a java.lang.Integer, whereby null implied that the value was not yet cached. However, this would have meant an additional storage overhead.

Regarding the probability of a String's hash code being computed as 0 I would say the probability is quite low and can happen in the following cases:

The String is empty (although recomputing this hash code each time is effectively O(1)).
An overflow occurs whereby the final computed hash code is 0 (e.g. Integer.MAX_VALUE + h(c1) + h(c2) + ... h(cn) == 0).
The String contains only Unicode character 0. Very unlikely as this is a control character with no meaning apart from in the "paper tape world" (!):

From Wikipedia:

Code 0 (ASCII code name NUL) is a special case. In paper tape, it is the case when there are no holes. It is convenient to treat this as a fill character without meaning otherwise.

edited Feb 22 '10 at 11:47

answered Feb 22 '10 at 11:23

Adamski

54,009
15
113
152

\u0000 is still alive if you interface with native code new File("C:\\CONFIG.SYS\u0000ignored").isFile() == true on my Windows machine. It's a source of all kinds of security problems. For most apps filter this character! – Thomas Jung Feb 22 '10 at 12:41
@Thomas Jung If you must look at a file path, normalise it first (and whitelist characters don't blacklist, of course). Even that isn't going to help you against symlinks. – Tom Hawtin - tackline Feb 22 '10 at 14:25
1

Note, if you have non-NUL characters the string need to be six or seven characters long before it can have a zero hash code. – Tom Hawtin - tackline Feb 22 '10 at 14:26

score 6 · Answer 5 · answered Jan 07 '12 at 08:04

6

This turns out to be a good question, related to a security vulnerability.

"When hashing a string, Java also caches the hash value in the hash attribute, but only if the result is different from zero. Thus, the target value zero is particularly interesting for an attacker as it prevents caching and forces re-hashing."

answered Jan 07 '12 at 08:04

cdunn2001

17,657
8
55
45

[no more](https://stackoverflow.com/a/64082954/1059372) – Eugene Sep 26 '20 at 22:14
@Eugene, Very nice follow-up! Thank you for all those details in [your answer](https://stackoverflow.com/a/64082954/263998). – cdunn2001 Jul 10 '21 at 17:30

score 5 · Answer 6 · answered Sep 26 '20 at 22:11

Ten years after and things have changed. I honestly can't believe this (but the geek in me is ultra-happy).

As you have noted there are chances where some String::hashCode for some Strings is zero and this was not cached (will get to that). A lot of people argued (including in this Q&A) why there was no addition of a field in java.lang.String, something like : hashAlreadyComputed and simply use that. The problem is obvious : extra-space for every single String instance. There is btw a reason java-9 introduced compact Strings, for the simple fact that many benchmarks have shown that this is a rather (over)used class, in the majority of the applications. Adding more space? The decision was : no. Especially since the smallest possible addition would have been 1 byte, not 1 bit (for 32 bit JMVs, the extra space would have been 8 bytes : 1 for the flag, 7 for alignment).

So, Compact Strings came along in java-9, and if you look carefully (or care) they did add a field in java.lang.String : coder. Didn't I just argue against that? It's not that easy. It seems that the importance of compact strings out-weighted the "extra space" argument. It is also important to say that extra space matters for 32 bits VM only (because there was no gap in alignment). In contrast, in jdk-8 the layout of java.lang.String is:

java.lang.String object internals:
 OFFSET  SIZE     TYPE DESCRIPTION                           VALUE
  0    12          (object header)                           N/A
 12     4   char[] String.value                              N/A
 16     4      int String.hash                               N/A
 20     4          (loss due to the next object alignment)
 Instance size: 24 bytes
 Space losses: 0 bytes internal + 4 bytes external = 4 bytes total

Notice an important thing right there:

Space losses : ... 4 bytes total.

Because every java Object is aligned (to how much depends on the JVM and some start-up flags like UseCompressedOops for example), in String there is a gap of 4 bytes, un-used. So when adding coder, it simply took 1 byte without adding additional space. As such, after Compact Strings were added, the layout has changed:

java.lang.String object internals:
 OFFSET  SIZE     TYPE DESCRIPTION                           VALUE
  0    12          (object header)                           N/A
 12     4   byte[] String.value                              N/A
 16     4      int String.hash                               N/A
 20     1     byte String.coder                              N/A
 21     3          (loss due to the next object alignment)
 Instance size: 24 bytes
 Space losses: 0 bytes internal + 3 bytes external = 3 bytes total

coder eats 1 byte and the gap was shrank to 3 bytes. So the "damage" was already made in jdk-9. For 32 bits JVM there was an increase with 8 bytes : 1 coder + 7 gap and for 64 bit JVM - there was no increase, coder occupied some space from the gap.

And now, in jdk-13 they decided to leverage that gap, since it exists anyway. Let me just remind you that the probability to have a String with zero hashCode is 1 in 4 billion; still there are people that say : so what? let's fix this! Voilá: jdk-13 layout of java.lang.String:

java.lang.String object internals:
OFFSET  SIZE      TYPE DESCRIPTION                            VALUE
  0    12           (object header)                           N/A
 12     4    byte[] String.value                              N/A
 16     4       int String.hash                               N/A
 20     1      byte String.coder                              N/A
 21     1   boolean String.hashIsZero                         N/A
 22     2           (loss due to the next object alignment)
 Instance size: 24 bytes
 Space losses: 0 bytes internal + 2 bytes external = 2 bytes total

And here it is : boolean String.hashIsZero. And here it is in the code-base:

public int hashCode() {
    int h = hash;
    if (h == 0 && !hashIsZero) {
        h = isLatin1() ? StringLatin1.hashCode(value)
                       : StringUTF16.hashCode(value);
        if (h == 0) {
            hashIsZero = true;
        } else {
            hash = h;
        }
    }
    return h;
}

Wait! h == 0 and hashIsZero field? Shouldn't that be named something like : hashAlreadyComputed? Why isn't the implementation something along the lines of :

    @Override
    public int hashCode(){
        if(!hashCodeComputed){
            // or any other sane computation
            hash = 42;
            hashCodeComputed = true;
        }

        return hash;
    }

Even if I read the comment under the source code:

    // The hash or hashIsZero fields are subject to a benign data race,
    // making it crucial to ensure that any observable result of the
    // calculation in this method stays correct under any possible read of
    // these fields. Necessary restrictions to allow this to be correct
    // without explicit memory fences or similar concurrency primitives is
    // that we can ever only write to one of these two fields for a given
    // String instance, and that the computation is idempotent and derived
    // from immutable state

It only made sense after I read this. Rather tricky, but this does one write at a time, lots more details in the discussion above.

score 0 · Answer 7 · answered Feb 22 '10 at 14:20

0

Why doesn't String's hashCode() cache 0?

The value zero is reserved as meaning "the hash code is not cached".

What is the probability that a Java string hashes to 0?

According to the Javadoc, the formula for a String's hashcode is:

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

using int arithmetic, where s[i] is the ith character of the string and n is the length of the string. (The hash of the empty String is defined to be zero as a special case.)

My intuition is that the hashcode function as above gives a uniform spread of String hash values across the range of int values. A uniform spread that would mean that the probability of a randomly generated String hashing to zero was 1 in 2^32.

What's the best way to avoid the performance penalty of recomputing the hash value every time for strings that hash to 0?

The best strategy is to ignore the issue. If you are repeatedly hashing the same String value, there is something rather strange about your algorithm.

Is this the best-practice way of caching values? (i.e. cache all except one?)

This is a space versus time trade-off. AFAIK, the alternatives are:

Add a cached flag to each String object, making every Java String take an extra word.
Use the top bit of the hash member as the cached flag. That way you can cache all hash values, but you only have half as many possible String hash values.
Don't cache hashcodes on Strings at all.

I think that the Java designers have made the right call for Strings, and I'm sure that they have done extensive profiling that confirms the soundness of their decision. However, it does not follow that this would always be the best way to deal with caching.

(Note that there are two "common" String values which hash to zero; the empty String, and the String consisting of just a NUL character. However, the cost of calculating the hashcodes for these values is small compared with the cost of calculating the hashcode for a typical String value.)

answered Feb 22 '10 at 14:20

Stephen C

698,415
94
811
1,216

I do not believe 1 in 2^32 is correct: For shorter strings the hash code will be in the range: [0, Integer.MAX_VALUE] and for any strings long enough to cause overflow the hash code will be in the range: [Integer.MIN_VALUE, Integer.MAX_VALUE]. Hence for randomly generated strings (and assuming a uniformly distributed hashing algorithm) the distribution is not completely uniform; there is a higher chance of a positive *or zero* hash code than a negative one. – Adamski Feb 22 '10 at 14:42
The `hashCode` algorithm causes integer overflow pretty quickly, *Adamski*. Taking a few random examples, 6 word characters appears to be enough - but I think your reasoning is sound, this does result in a skew towards positive hash values (which degrades as your Strings get longer) – oxbow_lakes Feb 22 '10 at 14:53
Randomly generated Strings have random lengths as well as random characters. – Stephen C Feb 22 '10 at 14:57
@Stephen: Random lengths is my exact point: For a completely uniform distribution of random length strings containing random characters there will be slightly more strings that hash to >= 0 because of shorter strings *not* causing overflow. – Adamski Feb 22 '10 at 15:06
You neglected the option I listed in my answer: adding an "if (hash == 0) hash = 1;" at the end of the algorithm. That way you don't lose half the normal hash values, just one fewer. – Jon Skeet Feb 22 '10 at 15:45
@Jon Skeet, your idea is motivated by a very common fallacy. A common well-intentioned "optimization" is to rob $1 from each of group A to give $100 to each of group B ... without stopping to realize that there are 10,000 times as many people in group A. It's a net loss. – Kevin Bourrillion Feb 22 '10 at 18:49
@Kevin: Because of the cost of the comparison and assigment? Yes, I guess that is a cost. A better solution would be to find an algorithm which naturally missed out just a few values, and pick one of those. – Jon Skeet Feb 22 '10 at 19:03
@Kevin: In fact I'd say this whole caching business falls foul of the same problem - *every* string is paying the storage penalty for optimising *some* strings being hashed multiple times. It would be really interesting to see what data the JDK engineers had before making this decision. – Jon Skeet Feb 22 '10 at 19:05
@KevinBourrillion: The fact that the string hash function can return a non-cacheable value means that programs which compare many long strings against each other may deterministically run *orders of magnitude* slower with certain inputs than with others. Making the hashcode function take a nanosecond longer to prevent it from having each execution have an independent random one in four billion chance of taking an extra millisecond would probably not be a good trade, but that's not the choice faced here. If a string fails to cache its hash code once, it will fail *every time*. – supercat Oct 31 '12 at 22:35

score 0 · Answer 8 · answered Apr 25 '15 at 09:16

Well folks, it keeps 0 because if it is zero length, it will end up as zero anyways.

And it doesn't take long to figure out that the len is zero and so must the hashcode be.

So, for your code-reviewz! Here it is in all it's Java 8 glory:

 public int hashCode() {
        int h = hash;
        if (h == 0 && value.length > 0) {
            char val[] = value;

            for (int i = 0; i < value.length; i++) {
                h = 31 * h + val[i];
            }
            hash = h;
        }
        return h;
    }

As you can see, this will always return a quick zero if the string is empty:

  if (h == 0 && value.length > 0) ...

score 0 · Answer 9 · answered Oct 28 '16 at 22:56

The "avoid 0" suggestion seems appropriate to recommend as best practice as it helps a genuine problem (seriously unexpected performance degradation in constructible cases that can be attacker supplied) for the meager cost of a branch operation prior to a write. There is some remaining 'unexpected performance degradation' that can be exercised if the only things going into a set hash to the special adjusted value. But this is at worst a 2x degradation rather than unbounded.

Of course, String's implementation can't be changed but there is no need to perpetuate the problem.

Why doesn't String's hashCode() cache 0?

9 Answers9

Linked