Java and string.hashCode() stability across machines in cluster

Question

I have asked similar question for the string.GetHashCode() method in .NET. Taken from then, I have learned that we cannot rely on the implicit implementation of hash code for the buit-in types, if we are to use it across different machines. Therefore, I am assuming that the Java implementation of String.hashCode() is also unstable across different hardware configurations and may behave differently across VMs (don't forget different VM implementations)

Currently we are discussing a way to safely transform a string into a number in Java, by hashing, but the hash algorithm must be stable across different nodes of a cluster, and be fast to evaluate, since there will be high frequency of usage. My team mates are insisting on the native hashCode method, and I'll need some reasonable arguments to make them reconsider another approach. Currently, I can think only of the differences between machine configurations (x86 and x64), possibly different vendors of the JVM on some of the machines (hardly applicable in our case) and byte-order differences, depending on the machine the algorithm is being run. Of course, character encoding is probably to be also considered.

While all these things come into my mind, I am not 100% sure in either of them to be strong reason enough, and I'd appreciate your expertize and experience in this area. This will help me build stronger arguments to favor writing a custom hashing algorithm. Also, I'd appreciate advices on what not to do when implementing it.

String hashcode is well defined and same on any Java platform. — ZhongYu, Mar 28 '13 at 22:51
http://stackoverflow.com/questions/785091/consistency-of-hashcode-on-a-java-string — zch, Mar 28 '13 at 22:52
@zhong.j.yu you're assuming [JRockit](http://www.oracle.com/technetwork/middleware/jrockit/overview/index.html) and [IBM JVM](http://publib.boulder.ibm.com/infocenter/java7sdk/v7r0/index.jsp?topic=%2Fcom.ibm.java.lnx.70.doc%2Fuser%2Fjava_jvm.html) have the same implementation for `String#hashCode`. — Luiggi Mendoza, Mar 28 '13 at 22:53
@zhong.j.yu, according to the source code of the `String` class, it looks stable enough. But as it occurs to be the case in .NET there are reasons not to rely on that. Having in mind the virtual machines, the hardware differences and byte-order or encoding varieties, the Java case seems very similar to me. — Ivaylo Slavov, Mar 28 '13 at 22:54
@zhong.j.yu they should, but there's always some vendor that doesn't follow the rules at all. — Luiggi Mendoza, Mar 28 '13 at 22:58

score 14 · Accepted Answer · answered Mar 28 '13 at 22:55

The implementation of String.hashCode() is specified in the documentation, so it's guaranteed to be consistent:

The hash code for a String object is computed as
  s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
using int arithmetic, where s[i] is the ith character of the string, n is the length of the string, and ^ indicates exponentiation. (The hash value of the empty string is zero.)

All of those operations are implemented platform-independently for Java -- the platform byte order is irrelevant, for example.

That said, ways of getting a String can be tricky, if you're getting it from a file or another source of bytes. In that case, you're fine so long as you explicitly specify a Charset. (Remember that Strings don't have different encodings per se; an encoding is a specification for conversions between a byte[] and a String.)

As far as all goes by specification (and core java components I know DO), it actually seems safe enough. Thanks — Ivaylo Slavov, Mar 28 '13 at 23:09

score 3 · Answer 2 · edited Mar 28 '13 at 23:07

3

You can look at the sourcecode, also shown below. From what I can see (after all of 10 seconds of analysis) this should be stable across machines and architectures. And Louis confirms this by citing a spec, even better if you believe specs. :-)

However, this could vary if a different JRE chooses to implement it differently and violate the spec.

public int hashCode() {
    int h = hash;
    if (h == 0) {
        int off = offset;
        char val[] = value;
        int len = count;

        for (int i = 0; i < len; i++) {
            h = 31*h + val[off++];
        }

        hash = h;
    }

    return h;
}

edited Mar 28 '13 at 23:07

Ivaylo Slavov

8,839
12
65
108

answered Mar 28 '13 at 22:57

user949300

15,364
7
35
66

Thank you for your answer. I have looked at the source code myself and I did not find anything that could be an issue. Still, something tells me this is not the only place where things can go wrong. Hopefully, different JVMs (different vendors) in the same cluster will not be a case for us. – Ivaylo Slavov Mar 28 '13 at 23:12
1

I would think that if a vendor is breaking the spec you could run a bunch of known Strings and compare to the official results. Be sure to run some _long_ ones. Back in Java's early days the hashCode method only considered the first 16 (maybe 32?) characters. I could see a vendor trying to win a benchmark by doing similar. – user949300 Mar 28 '13 at 23:25
Good advice, thanks for sharing it. I believe for the current matter we will stick with Oracle's JVM, although that knowledge might prove quite useful one day. Having thoughts on it, such "performance gain" may cost a lot of undesired and unpredictable behavior. Wonder if a JVM vendor out there could fall into that category – Ivaylo Slavov Mar 28 '13 at 23:29

Java and string.hashCode() stability across machines in cluster

2 Answers2