3

Only recently, I discovered that an empty String has hash code zero. This is surprising to me because null is normally assigned hash code zero, e.g., Objects.hashCode(Object) and ArrayList.hashCode().

Here is the JDK 11 source code for String.hashCode():

/** Cache the hash code for the string */
private int hash; // Default to 0

public int hashCode() {
    int h = hash;
    if (h == 0 && value.length > 0) {
        hash = h = isLatin1() ? StringLatin1.hashCode(value)
                              : StringUTF16.hashCode(value);
    }
    return h;
}

Idea: An empty String could have hash code one because that would match Arrays.hashCode(Object[]) for empty arrays. Alternatively, any other hard-coded, non-zero value could be used, similar to serialVersionUID. The purpose would be to distinguish from null. If this idea is flawed (except for backwards compatibility concerns), please kindly explain why.

I found other questions/answers that approach the issue... but none answer exactly:

kevinarpe
  • 20,319
  • 26
  • 127
  • 154
  • There is not even any backwards compatibility concern, because `hashCode` is explicitly allowed to change between runs of the same application. This is to prevent hash collision attacks. – Thomas Dec 13 '20 at 12:55
  • 5
    @Thomas However, the behavior for `String#hashCode()` [is specified](https://docs.oracle.com/javase/8/docs/api/java/lang/String.html#hashCode--). – chrylis -cautiouslyoptimistic- Dec 13 '20 at 12:58
  • If you're asking why this was chosen you'd have to ask them. If you're asking if anyone relies on this behavior you'd have to check aske the Java code ever written (it seems unlikely since all having a zero hash means is that the string has a zero hash). – Dave Newton Dec 13 '20 at 13:01
  • hashCode can't return null, because an int can't be null. – NomadMaker Dec 13 '20 at 13:08
  • *"I wonder if anyone actually relies upon this behaviour?"* - That's not answerable. Not least because some people *might* depend on it without realizing. – Stephen C Dec 13 '20 at 13:18
  • Let's just say no one should _knowingly_ rely on it – Kevin Anderson Dec 13 '20 at 13:19
  • There is a large number of strings with hash code equal to zero. There must be, because there are more strings than hash codes. – a guest Dec 13 '20 at 13:57

2 Answers2

7

Why does an empty Java String have hash code zero?

The short answer is because that it was how it was specified way back in Java 1.2. (And the Java 1.2 spec probably matched the implementation in earlier Java versions.)

I can't think of a strong technical reason why the String.hashcode("") should be zero.

However, I disagree with your argument that String.hashCode("") should be non-zero because Objects.hashCode(null) is zero.

  1. The Objects class was added in Java 7. Likewise the Arrays.hashCode methods were added in Java 1.5. So if anything, it is Objects and Arrays that are incorrect here.

  2. There is no expectation in the hashCode() definition that any particular different pair of values should be different. At best changing the hashCode value for "" would be a small optimization. Note that String.equals(null) is handled efficiently via an instanceof test.

  3. It is unusual for a hash table to have both null and "" as keys in the same table. Indeed, I would go so far as to say that it is quite likely indicates a design or implementation flaw that you need to have entries for both null and "".

  4. It could be argued that null should not be supported as a Map key at all. I know that null can be used as a key in a HashMap or LinkedHashMap, or as a value of a HashSet. But it is not the case for ConcurrentHashMap or HashTable or TreeMap or TreeSet. Indeed, I have heard from sources who should know that:

    • the Java designers responsible for the collection types think it was a mistake to support null keys, and

    • that is one reason why ConcurrentHashMap doesn't support this.

Given that the use of null keys in an application is (arguably) misguided, a breaking optimization that offers a small improvement for null keys is equally misguided.

It could be argued that not much code actually depends on the specified details of the String.hashCode algorithm. But the problem is, neither we or the Java designers have a good way of quantifying how many old applications would actually break1.

But breaking as little as 0.001% of existing Java applications is still a lot of applications, and a lot of annoyed Oracle customers. That is sufficient to make your idea a non-starter ... for Java.


1 - The argument that it would be the application programmers fault because relying on the hashcode values is somehow "back practice" doesn't wash with me. The fact that the algorithm is specified in this case (for whatever reason) means that programmers should be able to rely on it.

Stephen C
  • 698,415
  • 94
  • 811
  • 1,216
  • 1
    Just a note: changing the way `String.hashCode()` is computed would break everything that does a switch statement over a string expression. – Thomas Kläger Dec 13 '20 at 16:42
  • Very helpful. I also looked at standard collections: empty `ArrayList` has hash code one, but empty `HashSet` and `HashMap` have hash code zero. Maybe I am overthinking all of this... – kevinarpe Dec 14 '20 at 06:21
  • @ThomasKläger Can you share more? – kevinarpe Dec 14 '20 at 06:22
  • 1
    @kevinarpe the code that [compiles a switch on string](http://hg.openjdk.java.net/jdk8/jdk8/langtools/file/1ff9d5118aae/src/share/classes/com/sun/tools/javac/comp/Lower.java#l3629) explains the algorithm: "[..] two chained switch statements: the first a synthesized statement switching on the argument string's hash value [..]" – Thomas Kläger Dec 14 '20 at 07:05
3

The hashCode() original purpose is This method is supported for the benefit of hash tables such as those provided by HashMap.

So this means that the actual value of the hashCode does not carry any meaning other than probable similarity, as the doc states : It is not required that if two objects are unequal according to the equals(java.lang.Object) method, then calling the hashCode method on each of the two objects must produce distinct integer results.

About the arbitrary value of 0 for null and empty strings, the way the string hashCode is computed leads to other possible 0 hashCode even for non-empty strings.

System.out.println("".hashCode());             // >> 0
System.out.println("\0".hashCode());           // >> 0
System.out.println("\u0000".hashCode());       // >> 0
System.out.println("\u0000\u0000".hashCode()); // >> 0
System.out.println("\0\0\0".hashCode());       // >> 0

So the 0 value of the empty string makes sense in that the computation is

int h = 0;
for (byte v : value) {
    h = 31 * h + (v & 0xff);
}
return h;

So even of the shortcut if (h == 0 && value.length > 0) was not used, it would still lead to 0, this is just an optimization path.

In a way, one could argue that the null hashCode should not be 0 but maybe something like -1. But since the hashCode does not and should not carry any meaning, it should not matter anyway.

Simon
  • 2,353
  • 1
  • 13
  • 28
  • This answer is flawed in thinking that the value of a hashcode is insignificant; it is not. In an ideal world, every possible hashcode should be evenly distributed across every possible value. Having an empty string, and null, and strings with nullbytes having the same hashcode is concerning, as it can predictably and consistently lead to HashMaps and other hash-based structures slowing down. – Aplet123 Dec 13 '20 at 13:07
  • 1
    Please modify : The official Java Doc is flawed.My answer reflects that. I am not the author of Java. Although I agree with the fundment of your comment. – Simon Dec 13 '20 at 13:09
  • 1
    @Aplet123 He didn't say "hashcode is insignificant"; he is saying that "hashCode does not and should not carry any meaning" and this is true. That hash functions should distribute evenly and be perfect in an ideal world is another topic. The question here is if `0` is a good value for `""` and the answer is: it doesn't matter. – akuzminykh Dec 13 '20 at 13:11
  • This is interesting. It is even stranger now! If you look at the source code for `Arrays.hashCode()`, *any* non-`null` array has hash code *at least* one. Even if the whole array is filled with zero or `null`, the hash code *start* from one. (Yes, I know you can construct non-empty containers that give hash code zero. See: https://stackoverflow.com/questions/18746394/can-a-non-empty-string-have-a-hashcode-of-zero) – kevinarpe Dec 13 '20 at 13:14
  • Are you assuming that "starts from 1" implies "never less than 1"? Not true if overflow. – a guest Dec 13 '20 at 14:29