Consistency of hashCode() on a Java string

Question

The hashCode value of a Java String is computed as (String.hashCode()):

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

Are there any circumstances (say JVM version, vendor, etc.) under which the following expression will evaluate to false?

boolean expression = "This is a Java string".hashCode() == 586653468

Update #1: If you claim that the answer is "yes, there are such circumstances" - then please give a concrete example of when "This is a Java string".hashCode() != 586653468. Try to be as specific/concrete as possible.

Update #2: We all know that relying on the implementation details of hashCode() is bad in general. However, I'm talking specifically about String.hashCode() - so please keep the answer focused to String.hashCode(). Object.hashCode() is totally irrelevant in the context of this question.

Do you actually need this functionality ? Why do you need the precise value ? — Brian Agnew, Apr 24 '09 at 10:01
@Brian: I'm trying to understand the contract of String.hashCode(). — knorv, Apr 24 '09 at 10:02
@Knorv Its not ncessary to understand exactly how it works - its more important to understand the contract and its ulterior meaning. — mP., Apr 24 '09 at 10:09
@mP: Thanks for your input, but I guess that's up to me to decide. — knorv, Apr 24 '09 at 10:45
why did they give the first character the largest power? when you want to optimize it for speed in order to preserve extra calculations, you would store the power of the previous one, yet the previous one would be from the last character to the first one. this mean there would also be cache misses. isn't it more efficient to have an algorithm of: s[0] + s[1]*31 + s[2]*31^2 + ... + s[n-1]*31^[n-1] ? — android developer, Nov 18 '13 at 23:21

Jon Skeet · Accepted Answer · 2013-06-18T11:30:53.830

107

I can see that documentation as far back as Java 1.2.

While it's true that in general you shouldn't rely on a hash code implementation remaining the same, it's now documented behaviour for java.lang.String, so changing it would count as breaking existing contracts.

Wherever possible, you shouldn't rely on hash codes staying the same across versions etc - but in my mind java.lang.String is a special case simply because the algorithm has been specified... so long as you're willing to abandon compatibility with releases before the algorithm was specified, of course.

edited Jun 18 '13 at 11:30

answered Apr 24 '09 at 09:25

Jon Skeet

1,421,763
867
9,128
9,194

9

The documented behaviour of String has been specified since Java 1.2 In v1.1 of the API, the hash code computation is not specified for the String class. – Martin OConnor Apr 24 '09 at 11:34
In this case we better write our own hashing codes 'ight matey? – Felype Jun 18 '13 at 10:39
@Felype: I really don't know what you're trying to say here, I'm afraid. – Jon Skeet Jun 18 '13 at 11:30
@JonSkeet I mean, in this case we may perhaps write our own code to generate our own hash, to grant portability. Is it? – Felype Jun 19 '13 at 16:16
@Felype: It's not at all clear what kind of portability you're talking about, nor indeed what you mean by "in this case" - in what specific scenario? I suspect you should ask a new question. – Jon Skeet Jun 19 '13 at 16:20
Hard to believe java specifies a fixed algorithm for String#hash when it can lead to security concerns like this: http://stackoverflow.com/questions/8669946/application-vulnerability-due-to-non-random-hash-functions but I guess they won't change it now, to respect backward compatibility... – rogerdpack Nov 06 '13 at 20:02
1

most java code simply don't run in pre-1.4. it is safe to assume it is consistent. – J-16 SDiZ Nov 27 '14 at 10:50

score 20 · Answer 2 · answered Apr 24 '09 at 12:17

I found something about JDK 1.0 and 1.1 and >= 1.2:

In JDK 1.0.x and 1.1.x the hashCode function for long Strings worked by sampling every nth character. This pretty well guaranteed you would have many Strings hashing to the same value, thus slowing down Hashtable lookup. In JDK 1.2 the function has been improved to multiply the result so far by 31 then add the next character in sequence. This is a little slower, but is much better at avoiding collisions. Source: http://mindprod.com/jgloss/hashcode.html

Something different, because you seem to need a number: How about using CRC32 or MD5 instead of hashcode and you are good to go - no discussions and no worries at all...

Martin OConnor · Answer 3 · 2009-04-24T11:10:34.257

8

You should not rely on a hash code being equal to a specific value. Just that it will return consistent results within the same execution. The API docs say the following :

The general contract of hashCode is:

Whenever it is invoked on the same object more than once during an execution of a Java application, the hashCode method must consistently return the same integer, provided no information used in equals comparisons on the object is modified. This integer need not remain consistent from one execution of an application to another execution of the same application.

EDIT Since the javadoc for String.hashCode() specifies how a String's hash code is computed, any violation of this would violate the public API specification.

edited Apr 24 '09 at 11:10

answered Apr 24 '09 at 09:16

Martin OConnor

3,583
4
25
32

1

Your answer is valid, but does not address the specific question asked. – knorv Apr 24 '09 at 09:18
7

That's the *general* hash code contract - but the *specific* contract for String gives details of the algorithm, and effectively overrides this general contract IMO. – Jon Skeet Apr 24 '09 at 09:33

score 6 · Answer 4 · edited May 16 '23 at 05:20

6

As said above, in general you should not rely on the hash code of a class remaining the same. Note that even subsequent runs of the same application on the same VM may produce different hash values. AFAIK the Sun JVM's hash function calculates the same hash on every run, but that's not guaranteed.

Note that this is not theoretical. The hash function for java.lang.String was changed in JDK1.2 (the old hash had problems with hierarchical strings like URLs or file names, as it tended to produce the same hash for strings which only differed at the end).

java.lang.String is a special case, as the algorithm of its hashCode() is (now) documented, so you can probably rely on that. I'd still consider it bad practice. If you need a hash algorithm with special, documented properties, just write one :-).

edited May 16 '23 at 05:20

M. Justin

14,487
7
91
130

answered Apr 24 '09 at 09:23

sleske

81,358
34
189
227

5

But was the algorithm specified in the docs before JDK 1.2? If not, it's a different situation. The algorithm is now laid down in the docs, so changing it would be a breaking change to a public contract. – Jon Skeet Apr 24 '09 at 09:27
(I remember it as 1.1.) The original (poorer) algorithm was documented. Incorrectly. The documented algorithm actually threw an ArrayIndexOutOfBoundsException. – Tom Hawtin - tackline Apr 24 '09 at 10:00
@Jon Skeet: Ah, didn't know that the algorithm of String.hashCode() is documented. Of course that changes things. Updated my comment. – sleske Apr 27 '09 at 13:49

Brian Agnew · Answer 5 · 2009-04-24T09:24:34.167

3

Another (!) issue to worry about is the possible change of implementation between early/late versions of Java. I don't believe the implementation details are set in stone, and so potentially an upgrade to a future Java version could cause problems.

Bottom line is, I wouldn't rely on the implementation of hashCode().

Perhaps you can highlight what problem you're actually trying to solve by using this mechanism, and that will highlight a more suitable approach.

edited Apr 24 '09 at 09:24

answered Apr 24 '09 at 09:16

Brian Agnew

268,207
37
334
440

1

Thanks for your answer. Can you give a concrete examples of when "This is a Java string".hashCode() != 586653468? – knorv Apr 24 '09 at 09:20
1

No. Sorry. My point is that everything you test on may work the way you want. But that's still no guarantee. So if youre' working on a (say) short term project where you have control of the VM etc., then the above may work for you. But you can't rely on it in the wider world. – Brian Agnew Apr 24 '09 at 09:23
2

"an upgrade to a future Java version could cause problems". An upgrade to a future Java version could remove the hashCode method entirely. Or make it always return 0 for strings. That's incompatible changes for ya. The question is whether Sun^HOracle^HThe JCP would consider it a breaking change and therefore worth avoiding. Since the algorithm is in the contract, one hopes they would. – Steve Jessop Apr 24 '09 at 10:32
@SteveJessop well, since `switch` statements over strings compile to code relying on a particular fixed hash code, changes to `String`’s hash code algorithm would definitely break existing code… – Holger Aug 03 '18 at 17:41

score 3 · Answer 6 · edited Mar 01 '15 at 02:08

3

Just to answer your question and not to continue any discussions. The Apache Harmony JDK implementation seems to use a different algorithm, at least it looks totally different:

Sun JDK

public int hashCode() {
    int h = hash;
    if (h == 0) {
        int off = offset;
        char val[] = value;
        int len = count;

        for (int i = 0; i < len; i++) {
            h = 31*h + val[off++];
        }
        hash = h;
    }
    return h;
}

Apache Harmony

public int hashCode() {
    if (hashCode == 0) {
        int hash = 0, multiplier = 1;
        for (int i = offset + count - 1; i >= offset; i--) {
            hash += value[i] * multiplier;
            int shifted = multiplier << 5;
            multiplier = shifted - multiplier;
        }
        hashCode = hash;
    }
    return hashCode;
}

Feel free to check it yourself...

edited Mar 01 '15 at 02:08

Cosmo Harrigan

895
1
8
22

answered Apr 24 '09 at 11:21

ReneS

3,535
2
26
35

23

I think they're just being cool and optimizing it. :) "(multiplier << 5) - multiplier" is just 31 * multiplier, after all ... – unwind Apr 24 '09 at 11:32
Ok, was too lazy to check that. Thanks! – ReneS Apr 24 '09 at 12:14
1

But to make it clear from my side... Never rely on the hashcode because the hashcode is something internal. – ReneS Apr 24 '09 at 12:16
1

what are the variables of "offset", "count" and "hashCode" mean? i suppose "hashcode" is used as a cached value, to avoid future calculations, and that "count" is the number of characters, but what is the "offset" ? suppose I wish to use this code so that it will be consistent, given a string, what should i do to it? – android developer Nov 18 '13 at 23:07
@androiddeveloper See here: http://stackoverflow.com/questions/38546623/what-are-variables-offset-hash-in-string-hashcode – Kartik Chugh Dec 13 '16 at 13:52
@KartikChughヅ Interesting. Thanks. So what should I do to get a consistent behavior? – android developer Dec 14 '16 at 07:42
@androiddeveloper What do you mean? You don't need to use this code, they are both the same, and are implemented in String.hashCode(). Since the docs have now specified the implementation, you're pretty much guaranteed to get consistent results for a long time. – Kartik Chugh Dec 14 '16 at 22:24
@KartikChughヅ If I use the built in API of Android for hashCode, will it return the same result for each Android version? – android developer Dec 15 '16 at 06:22
1

@androiddeveloper Now THAT'S an interesting question -- although I should have guessed it, based on your username. From the [Android docs](https://developer.android.com/reference/java/lang/String.html#hashCode()) it looks like the contract is the same: `s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]` Unless I am mistaken, this is because Android uses Sun's implementation of the String object with no changes. – Kartik Chugh Dec 16 '16 at 14:04
The harmony one processes the characters in reverse order, are the results the same for "forward" and "drawrof" ??? that would mean net and ten and stop and pots etc would have the same hash codes ? – peterk Oct 19 '17 at 03:04
@peterk, no the hash codes for `"net"` and `"ten"` are not the same. These are just different algorithms. The algorithm of Harmony is designed to iterate backwards and produce the same value as Sun’s algorithm when iterating forward. Sun’s algorithm uses a constant multiplier, but multiplies the hashcode of all previous characters with it, whereas Harmony’s algorithms uses a changing multiplier for the current character, but just sums the product to the hashcode of subsequent characters. – Holger Aug 03 '18 at 17:39

score 2 · Answer 7 · answered Apr 24 '09 at 15:27

2

If you're worried about changes and possibly incompatibly VMs, just copy the existing hashcode implementation into your own utility class, and use that to generate your hashcodes .

answered Apr 24 '09 at 15:27

Sam Barnum

10,559
3
54
60

I was going to say this. While the other answers do answer the question, writing a separate hashCode function is probably the appropriate solution to knorv's problem. – Nick Apr 06 '11 at 09:38

score 1 · Answer 8 · edited Jul 03 '19 at 06:12

The hashcode will be calculated based on the ASCII values of the characters in the String.

This is the implementation in the String Class is as follows

public int hashCode() {
    int h = hash;
    if (h == 0 && value.length > 0) {
        hash = h = isLatin1() ? StringLatin1.hashCode(value)
                              : StringUTF16.hashCode(value);
    }
    return h;
}

Collisions in hashcode are unavoidable. For example, the strings "Ea" and "FB" give the same hashcode as 2236

Consistency of hashCode() on a Java string

8 Answers8

Linked

Related