10

The JDK documentation for java.lang.String.hashCode() famously says:

The hash code for a String object is computed as

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

using int arithmetic, where s[i] is the *i*th character of the string, n is the length of the string, and ^ indicates exponentiation.

The standard implementation of this expression is:

int hash = 0;
for (int i = 0; i < length; i++)
{
    hash = 31*hash + value[i];
}
return hash;

Looking at this makes me feel like I was sleeping through my algorithms course. How does that mathematical expression translate into the code above?

Community
  • 1
  • 1
David Citron
  • 43,219
  • 21
  • 62
  • 72

5 Answers5

25

unroll the loop. Then you get:

int hash = 0;

hash = 31*hash + value[0];
hash = 31*hash + value[1];
hash = 31*hash + value[2];
hash = 31*hash + value[3];
...
return hash;

Now you can do some mathematical manipulation, plug in 0 for the initial hash value:

hash = 31*(31*(31*(31*0 + value[0]) + value[1]) + value[2]) + value[3])...

Simplify it some more:

hash = 31^3*value[0] + 31^2*value[1] + 31^1*value[2] + 31^0*value[3]...

And that is essentially the original algorithm given.

CookieOfFortune
  • 13,836
  • 8
  • 42
  • 58
  • You may want to explain it in terms of static single assignment (SSA) form, which then removes the need to think about what value "hash" has at any given point in time. :-) – C. K. Young May 04 '09 at 22:30
  • Looks like the original algorithm says it should be: 31^3*value[0] + 31^2*value[1] + 31^1*value[2] + ... Or is it just my fried brain misfiring? – Adnan May 04 '09 at 23:14
13

I'm not sure if you missed where it says "^ indicates exponentiation" (not xor) in that documentation.

Each time through the loop, the previous value of hash is multipled by 31 again before being added to the next element of value.

One could prove these things are equal by induction, but I think an example might be more clear:

Say we're dealing with a 4-char string. Let's unroll the loop:

hash = 0;
hash = 31 * hash + value[0];
hash = 31 * hash + value[1];
hash = 31 * hash + value[2];
hash = 31 * hash + value[3];

Now combine these into one statement by substituting each value of hash into the following statement:

hash = 31 * (31 * (31 * (31 * 0 + value[0]) + value[1]) + value[2])
     + value[3];

31 * 0 is 0, so simplify:

hash = 31 * (31 * (31 * value[0] + value[1]) + value[2])
     + value[3];

Now multiply the two inner terms by that second 31:

hash = 31 * (31 * 31 * value[0] + 31 * value[1] + value[2])
     + value[3];

Now multiply the three inner terms by that first 31:

hash = 31 * 31 * 31 * value[0] + 31 * 31 * value[1] + 31 * value[2]
     + value[3];

and convert to exponents (not really Java anymore):

hash = 31^3 * value[0] + 31^2 * value[1] + 31^1 * value[2] + value[3];
Laurence Gonsalves
  • 137,896
  • 35
  • 246
  • 299
  • RE your first sentence: Did you see some evidence that the question or a particular answer was assuming xor? – David Citron May 05 '09 at 14:13
  • You'd expressed confusion about how the code and the documentation could be equivalent. Since the documentation was using "^" for exponentiation, but Java normally uses it to mean bitwise xor I wondered if that was the source of your confusion. (There were no other answers when I started writing my answer, BTW) – Laurence Gonsalves May 05 '09 at 20:12
  • Ahh, I see. No, I was aware that it was exponentiation, but unclear on how the implementation followed from the mathematical expression. Your answer clarifies that greatly--but knowing to write that code given only that expression is still a leap for me. To arrive at that code, it would seem that you'd have to write out a small example, realize that you can "multiply by 0 in a clever way" in the innermost nesting to complete the pattern, then form the loop. – David Citron May 05 '09 at 23:06
  • 3
    It would not surprise me at all if the code actually came first, and the documentation was written afterwards. – Laurence Gonsalves May 05 '09 at 23:32
10

Proof by induction:

T1(s) = 0 if |s| == 0, else s[|s|-1] + 31*T(s[0..|s|-1])
T2(s) = s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
P(n) = for all strings s s.t. |s| = n, T1(s) = T2(s)

Let s be an arbitrary string, and n=|s|
Base case: n = 0
    0 (additive identity, T2(s)) = 0 (T1(s))
    P(0)
Suppose n > 0
    T1(s) = s[n-1] + 31*T1(s[0:n-1])
    T2(s) = s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1] = s[n-1] + 31*(s[0]*31^(n-2) + s[1]*31^(n-3) + ... + s[n-2]) = s[n-1] + 31*T2(s[0:n-1])
    By the induction hypothesis, (P(n-1)), T1(s[0:n-1]) = T2(s[0:n-1]) so
        s[n-1] + 31*T1(s[0..n-1]) = s[n-1] + T2(s[0:n-1])
    P(n)

I think I have it, and a proof was requested.

Devin Jeanpierre
  • 92,913
  • 4
  • 55
  • 79
9

Take a look at the first few iterations and you'll see the pattern start to emerge:

hash0 = 0 + s0 = s0
hash1 = 31(hash0) + s1 = 31(s0) + s1
hash2 = 31(hash1) + s2 = 31(31(s0) + s1) + s2 = 312(s0) + 31(s1) + s2
...
Bobby Eickhoff
  • 2,585
  • 2
  • 23
  • 22
  • 1
    <3 Thanks for (more or less) writing out CookieOfFortune's answer in SSA form. Much appreciated! – C. K. Young May 04 '09 at 22:34
  • Would be even better if you could vertically align all the corresponding terms, and distribute the 31(...) in the third line. – Nikhil May 04 '09 at 22:58
  • @CookieOfFortune: There's an HTML tag for it. Look at the page source. I'd have used Unicode, though. – Nikhil May 04 '09 at 23:00
0

Isn't it useless at all to count the hashcode of the String out of all characters? Imagine filenames or classnames with their full path put into HashSet. Or someone who uses HashSets of String documents instead of Lists because "HashSet always beats Lists".

I would do something like:

int off = offset;
char val[] = value;
int len = count;

int step = len <= 10 ? 1 : len / 10;

for (int i = 0; i < len; i+=step) {
   h = 31*h + val[off+i];
}
hash = h

At the end hashcode is nothing more than a hint.

Community
  • 1
  • 1
David
  • 1
  • Ignoring half the characters in the string would mean that storing a sequence of "counting strings" into a hash table could easily cause 100 strings to map to each hash value. Ignoring more than half the characters would make things even worse. Ignoring any aspect of the string for hashing purposes risks a really huge penalty in exchange for a pretty small payoff. – supercat Jul 22 '13 at 20:20
  • That's essentially what the early designers of java though. Initially the string hash function took only a sample of characters when the string was longer than 15 characters. Eventually it had to be fixed because it turned out to yield very bad hash performance with certain strings (e.g. with set of URLs which often look similar): http://bugs.java.com/bugdatabase/view_bug.do?bug_id=4045622. The performance gains for not using the whole string can not offset the much worse hash performance. – David Ongaro May 18 '16 at 19:18
  • To clarify: the second type of performance if referring to the "hash table" performance, not to the raw speed of computing the hash. – David Ongaro May 18 '16 at 19:24