7

When looking at the source code of java.lang.String of openjdk-1.6, i saw that String.hashCode() uses 31 as prime number and computes

s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]

Now the reason for me to look at this was the question i had in mind whether comparing hashCodes in String.equals would make String.equals significantly faster. But looking at hashCode now, the following questions come to my mind:

  • Wouldn't a bigger prime help avoid collisions better, at least for short strings, seeing that for example "BC" has the same hash as "Ab" (since the ascii letters live in the region 65-122, wouldn't a prime higher than that work better)?
  • Is it a conscious decision to use 31 as prime, or just some random one that is used because it is common?
  • How likely is a hash collision, given a fixed String length? where this question is heading is the original question how good comparing hashCodes and String length could already discern strings, to avoid comparing the actual contents.
  • a little off-topic, maybe: Is there a good reason String.equals does not compare hashCodes as additional shortcut?
  • a little more off-topic: assuming we have two Strings with the same content, but different instances: is there any way to assert equality without actually comparing the contents? I would guess not, since someway into String lengths, the space explodes into sizes where we will inevitably have collisions, but what about some restrictions - only a certain character set, a maximum string length... and how much do we need to restrict the string space to be able to have such a hash function?
kutschkem
  • 7,826
  • 3
  • 21
  • 56
  • Off-topic- You can't use `hashCode()` to check equality because, practically 2 un-eqaul String objects **may have** same `hashCode()` – sanbhat Jul 17 '13 at 08:43
  • @sanbhat OP means if using `hashCode` can be a first way to know if both `Strings` must really compare its contents. – Luiggi Mendoza Jul 17 '13 at 08:45
  • 3
    @sanbhat I think it's quite clear from the question that the OP is aware of this. The part of the question that is relevant asks why we don't use `hashcode` to shortcut `equals` i.e. if hashcodes are different they cannot be equal. – selig Jul 17 '13 at 08:46
  • 3
    hashCode() runs through all the chars in the string. Whereas equals, most of the time, can stop looping after 1 or 2 characters. I would guess that this is the reason hashCode() is not used by equals(). It could be used if both Strings already have a computed (and cached) hashCode, though, but my guess is that Sun engineers have already benchmarked it. – JB Nizet Jul 17 '13 at 08:47
  • 1
    @JBNizet hashCode is cached in String – kutschkem Jul 17 '13 at 08:49
  • 3
    Look at this related post for the explanation of why 31: http://stackoverflow.com/questions/299304/why-does-javas-hashcode-in-string-use-31-as-a-multiplier – Mik378 Jul 17 '13 at 08:50
  • 1
    Yes, I'm aware of that. What I'm saying is that equals could check if both strings have already computed their hashcode, and compare them if that's the case. I'm not implying that the hashCode is not cached. – JB Nizet Jul 17 '13 at 08:51
  • @Mik378 Ok, but there are many primes that look like (2^n)-1, any of those could be used with the same reasoning. – kutschkem Jul 17 '13 at 08:53
  • 1
    @kutschkem The most interesting property of 31 is: the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i Not every (2^n)-1 has this awesome property. – Mik378 Jul 17 '13 at 08:58
  • 1
    @Mik378 When would they not have this property? Any multiplication with a power of two can be replaced by a shift!? – kutschkem Jul 17 '13 at 09:01
  • 1
    About the 'little more off-topic' question: let's assume you limit the character set to lowercase, uppercase, and 12 special characters (because 64 is so convenient). Then even a perfect hash function could only allow strings up to length 5 (64^5 = 2^30 possibilities), otherwise an integer hash code wouldn't be able to distinguish them all. – Vincent van der Weele Jul 17 '13 at 09:03
  • Could bring some answer: http://www.javamex.com/tutorials/collections/hash_function_technical_2.shtml The reason of 31 and not any other could be that there are 26 letters in english alphabet, thus covering the whole. There is no need to involve a bigger number than 31 for Strings' hashcode. The interest for shifting without a lot of loss (maybe none in certain case) is to blend bits each other between a current haschcode and the immediate previous calculated hashcode (explaining the subtraction), in order to more likely get a unique result. – Mik378 Jul 17 '13 at 09:11
  • @kutschkem I think they meant a *single* shift. – Dave Newton Jul 17 '13 at 09:50
  • @Mik378 31 doesn't even cover the upper- and lower-case characters, let alone the digits, let alone the punctuation, let alone the special characters. Not a convincing explanation. Not that one is really neeed. – user207421 Jul 17 '13 at 10:04
  • @EJP Indeed...you're right for the addition of upper cases etc.. – Mik378 Jul 17 '13 at 10:09

1 Answers1

9

Wouldn't a bigger prime help avoid collisions better, at least for short strings, seeing that for example "BC" has the same hash as "Ab" (since the ascii letters live in the region 65-122, wouldn't a prime higher than that work better)?

Each character in a String can take 65536 values (2^16). The set of Strings of 1 or 2 characters is therefore larger than the number of int and any hashcode calculation methodology will produce collisions for strings that are 1 or 2 character long (which qualify as short strings I suppose).

If you restrict your character set, you can find hash function that reduce the number of collisions (see below).

Note that a good hash must also provide a good distribution of output. A comment buried in this code advocates using 33 and gives the following reasons (emphasis mine):

If one compares the chi^2 values [...] of the variants the number 33 not even has the best value. But the number 33 and a few other equally good numbers like 17, 31, 63, 127 and 129 have nevertheless a great advantage to the remaining numbers in the large set of possible multipliers: their multiply operation can be replaced by a faster operation based on just one shift plus either a single addition or subtraction operation. And because a hash function has to both distribute good and has to be very fast to compute, those few numbers should be preferred.

Now these formulae were designed a while ago. Even if it appeared now that they are not ideal, it would be impossible to change the implementation because it is documented in the contract of the String class.

Is it a conscious decision to use 31 as prime, or just some random one that is used because it is common?

Why does Java's hashCode() in String use 31 as a multiplier?

How likely is a hash collision, given a fixed String length?

Assuming each possible int value has the same probability of being the result of the hashcode function, the probability of collision is 1 in 2^32.

Is there a good reason String.equals does not compare hashCodes as additional shortcut?

Why does the equals method in String not use hash?

assuming we have two Strings with the same content, but different instances: is there any way to assert equality without actually comparing the contents?

Without any constraint on the string, there isn't. You could intern the strings then check for reference equality (==), but if many strings are involved, that can be inefficient.

how much do we need to restrict the string space to be able to have such a hash function?

If you only allow small cap letters (26 characters), you could design a hash function that generates unique hashes for any strings of length 0 to 6 characters (inclusive) (sum(i=0..6) (26^i) = 3.10^8).

Community
  • 1
  • 1
assylias
  • 321,522
  • 82
  • 660
  • 783
  • 1
    +1 I can't believe no one else has upvoted this quite comprehensive and considered answer. Good job (as always!) – Bohemian Jul 17 '13 at 09:56
  • @Bohemian That's very kind, thank you. – assylias Jul 17 '13 at 10:03
  • +1 very good answer, although i was kind of assuming that ascii characters are far more likely than, let's say, chinese characters. I still accept this as answer because it states the important reason "it is documented that way and shouldn't change", which seems valid, albeit not very satisfying. And honestly i think the accepted answer to http://stackoverflow.com/questions/299304/why-does-javas-hashcode-in-string-use-31-as-a-multiplier lacks very much... – kutschkem Jul 17 '13 at 12:16