7

Possible Duplicate:
Why does Java's hashCode() in String use 31 as a multiplier?
Why use a prime number in hashCode?

From Effective Java Item 9: Always override hashCode when you override equals consider the following relevant code snippet that overrides the hashcode() defined in Object class.

public final class PhoneNumber {
  private final short areaCode;
  private final short prefix;
  private final short lineNumber;
  ......//Rest ignored on purpose
  .....

private volatile int hashCode; // (See Item 71)
@Override public int hashCode() {
   int result = hashCode;
   if (result == 0) {
    result = 17;
    result = 31 * result + areaCode;
    result = 31 * result + prefix;
    result = 31 * result + lineNumber;
    hashCode = result;
   }
 return result;
 }
}

I understand why a non zero initial value "17" is chosen . I also understand the that multiplication is done in each step so that the order of fields play an important role in computing hashcode(). But I do not understand the reason for choosing 31 as the value for multiplication . Can some one explain to me why ? This is what Bloch has to say about 31 but I do not really get it . I specifically fail to understand the line in italics below.

The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional.

Community
  • 1
  • 1
Geek
  • 26,489
  • 43
  • 149
  • 227

3 Answers3

11

Shifting left just introduces a zero on the right and loses a bit on the left of the number's binary representation, so it's a clear information loss. Repeating this process gradually loses all information that was accumulated from earlier computation. That means that the more fields enter your hashcode calculation, the less effect on the final result the early fields have.

Marko Topolnik
  • 195,646
  • 29
  • 319
  • 436
  • Yep, prime and irrational numbers are quite useful in hashing. – Alexey Frunze Jul 18 '12 at 08:21
  • @Marko very nice explanation > I wish i could give you more than +1. Thanks . So if we lose bits while doing multiplication why is it at all useful ? – Geek Jul 18 '12 at 08:44
  • 1
    The objective of `hashCode` is basically randomizing it across all unequal instances. Overflows are no concern here. Multiplication is there to avoid collisions between two fields having commuted values. For example, if you have `int x, y`, you don't want the same hashCode when you swap x and y. Multiplication by an odd prime ensures that. – Marko Topolnik Jul 18 '12 at 08:49
  • 1
    @MarkoTopolnik yes I agree with the purpose of multiplication, otherwise all anagrams of a String would have same hashcode() . – Geek Jul 18 '12 at 09:20
3

he reason for using a prime is that it is more likely to produce a random pattern. If you use 9 for example you can get over lap with multiples of 3.

AFAIK 31 is used for Strings as there is less than 31 letters in the alphabet meaning all words of up to 6 letters have a unique hash code. If you use 61 (prime less than 64) for example, up to 5 letters would produce unique codes and if you use 13 (prime less than 16) you can get collisions with two letters words.

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
  • I'm confused for *31* **all words of up to 6 letters have a unique hash code** and for *61* **up to 5 letters would produce unique codes** Why is the letter count smaller for the larger prime? That sounds backwards. Is it because *31* is being run through a dictionary first rather than just random combinations? If so sounds kinda like comparing apples to oranges in your answer. – Raystorm Sep 18 '14 at 19:49
  • 1
    @Raystorm The number of values you can store in 32-bits is limited so if you use 31 values or ~5 bits per letter, you can have 32/5 = 6 letters, and if you use 61 values or ~6 bits per letter you can have 32/6 = 5 letters. If you have more letters than this, the number overflows and there will be sequences of letters which have the same hashcode. – Peter Lawrey Sep 22 '14 at 17:58
  • 2
    @Peter. That was my thought as well. However I think that that would be true if the value range for each character was only the alphabet letters. But each character can have a range of thousands of values, thus in the end using a lot more bits. – Dimitrios Menounos Dec 07 '16 at 15:15
1

I'm going to describe the answer for a different number, but I suspect that the reasoning is similar. The contribution to the hash value of a character X is X*B^k, where B in your case is 31, and k depends on the position of X in the string and its length. This arithmetic is usually done modulo the word size. For this reason we want B^k to be different for different values of k.

Now, in "Handbook of Algorithms and Data Structures" by Gonnet and Baeza-Yates section 3.3.1 "Pratical hashing functions" they say "For this function the value B=131 is recommended, as B^i has a maximum cycle mod 2^k for 8 <= k <= 64." I wonder what cycle length 31 has mod 2^32? I believe that 31 will fit into a Sparc immediate operand, but 131 will not.

mcdowella
  • 19,301
  • 2
  • 19
  • 25