When looking at the source code of java.lang.String of openjdk-1.6, i saw that String.hashCode() uses 31 as prime number and computes
s[0]*31^(n-1) + s[1]*31^(n-2) + ... + s[n-1]
Now the reason for me to look at this was the question i had in mind whether comparing hashCodes in String.equals would make String.equals significantly faster. But looking at hashCode now, the following questions come to my mind:
- Wouldn't a bigger prime help avoid collisions better, at least for short strings, seeing that for example "BC" has the same hash as "Ab" (since the ascii letters live in the region 65-122, wouldn't a prime higher than that work better)?
- Is it a conscious decision to use 31 as prime, or just some random one that is used because it is common?
- How likely is a hash collision, given a fixed String length? where this question is heading is the original question how good comparing hashCodes and String length could already discern strings, to avoid comparing the actual contents.
- a little off-topic, maybe: Is there a good reason String.equals does not compare hashCodes as additional shortcut?
- a little more off-topic: assuming we have two Strings with the same content, but different instances: is there any way to assert equality without actually comparing the contents? I would guess not, since someway into String lengths, the space explodes into sizes where we will inevitably have collisions, but what about some restrictions - only a certain character set, a maximum string length... and how much do we need to restrict the string space to be able to have such a hash function?