13

I have been reading about hashcode functions for the past couple of hours and have accumulated a couple of questions regarding use of prime numbers as multipliers in custom hashcode implementations. I would be grateful if I could get some insight regarding following questions:

  • In a comment to @mattb's answer here, @hstoerr advocates for use of larger primes (such as 524287) instead of the common prime 31. My question is, given the following implementation of a hashcode functions for a pair or elements:

    @Override
    public int hashCode() {
        final int prime = 31;
        int hash1 = (pg1 == null) ? 0 : pg1.hashCode();
        int hash2 = (pg2 == null) ? 0 : pg2.hashCode();
        return prime * (hash1 ^ hash2);
    }
    

doesn't this lead to an overflow on the returned int if prime is a large number?

  • Assuming that the overflow is not a problem (JVM doing an automatic cast) is it better to do a bitshift instead of a cast?

  • I imagine the performance of the hashcode function vary significantly based on the complexity of the hashcode. Does the size of the prime multiplier not effect the performance?

  • Is it better/smarter/faster to use multiple primes in a custom hashcode function instead of a single multiplier? If not, is there some other advantage? See the example below from @jinguy's answer to a relevant question:

    public int hashCode() {
        return a * 13 + b.hashCode() * 23 + (c? 31: 7);
    }
    

where a is an int, b is a String and c is boolean.

  • How about something like long lhash = prime * (hash1 ^ hash2); then using (int)((lhash >> 32) ^ lhash)? That's something I saw on another question here SO, but it wasn't really explained why it was a good idea to do it like that.
Community
  • 1
  • 1
posdef
  • 6,498
  • 11
  • 46
  • 94
  • It's probably better to post multiple questions separately, instead of grouped together as above. This makes it easier to answer, and focused questions are more likely to be useful to others in future Google searches. Kudos for citing references on your research, by the way. – GargantuChet Aug 22 '12 at 15:54
  • 2
    @GargantuChet while I'd normally agree with your statement, I think these 4 question are actually pretty closely related to one another. Asking a new question for every detail of the subject will ultimately lead to overflooding of questions, I think. (Would be cool to see what the mods think about it, though) – posdef Aug 22 '12 at 15:56

2 Answers2

9

Apologies in advance for the novel. Feel free to make suggestions or edit directly. --Chet

There is an overflow, but not an exception.

The danger doesn't come from losing accuracy, but losing range. Let's use a ridiculous example, where "prime" is a large power of 2, and 8-bit unsigned numbers for brevity. And assume that (hash1 ^ hash2) is 255:

        "prime": 1000 0000
(hash1 ^ hash2): 1111 1111

Showing the truncated digits in brackets, our result is:

        product: [0111 1111] 1000 0000

But multiplying by 128 is the same as shifting left by 7 places. So we know that whatever the value of (hash1 ^ hash2), the least-significant places of the product will have seven zeros. So if (hash1 ^ hash2) is odd (least significant bit = 1), then the result of multiplying by 128 will always be 128 (after truncating the higher digits). And if (hash1 ^ hash2) is even (LSB is 0, then the product will always be zero.

This extends to larger bit sizes. The general point is that if the lower bits of "prime" are zeros, you're doing a shift (or multiple shift + sum) operation that will give you zeros in the lower bits. And the range of the product of multiplication will suffer.

But let's try making "prime" odd, so that the least significant bit will always be 1. Think about decomposing this into shift / add operations. The unshifted value of (hash1 ^ hash2) will always be one of the summands. The least significant bits that were shifted into guaranteed uselessness by an even "prime" multiplier will now be set based on, at minimum, the bits from the original (hash1 ^ hash2) value.

Now, let's consider a value of prime which is actually prime. If it's more than 2, then we know it's odd. So the lower bits haven't been shifted into uselessness. And by choosing a sufficiently large prime, you get better distribution across the range of output values than you'd get with a smaller prime.

Try some exercises with 16-bit multiplication using 8443 (0010 0000 1111 1011) and 59 (0000 0000 0011 1011). They're both prime, and the lower bits of 59 match the lower bits of 65531. For example, if hash1 and hash2 are both ASCII character values (0 .. 255), then all of the results of (hash1 ^ hash2) * 59 will be <= 15045. This means that roughly 1/4 of the range of hash values (0..65535) for a 16-bit number go unused.

But (hash1 ^ hash2) * 8443 is all over the map. It overflows if (hash1 ^ hash2) is as low as 8. It uses all 16 bits even for very small input numbers. There's much less clustering of hash values across the overall range, even if the input numbers are in a relatively small range.

Assuming that the overflow is not a problem (JVM doing an automatic cast) is it better to do a bitshift instead of a cast?

Most likely not. The JVM should translate into an efficient implementation on the host processor anyway. Integer multiplication should be implemented in hardware. And if not, the JVM is responsible for translating the operation into something reasonable for the CPU. It's very likely that the case of integer multiplication is highly optimized already. If integer multiplication is done more quickly on a given CPU as shift-and-add, the JVM should implement it that way. But it's less likely that the folks writing the JVM would care to watch for cases where multiple shift-and-add operations could have been combined into a single integer multiply.

I imagine the performance of the hashcode function vary significantly based on the complexity of the hashcode. Does the size of the prime multiplier not effect the performance?

No. The operations are the same when done in hardware regardless of the size, number of bits set, etc. It's probably a couple of clock cycles. It would vary depending on the specific CPU, but should be a constant-time operation regardless of the input values.

Is it better/smarter/faster to use multiple primes in a custom hashcode function instead of a single multiplier? If not, is there some other advantage?

Only if it reduces the possibility of collisions, and this depends on the numbers you're using. If your hash code depends on A and B and they're in the same range, you might consider using different primes or shifting one of the input values to reduce overlap between the bits. Since you're depending on their individual hash codes, and not their values directly, it's reasonable to assume that their hash codes provide good distribution, etc.

One factor that comes to mind whether you want the hash code for (x, y) to be different from (y, x). If your hash function treats A and B in the same way, then hash(x, y) = hash(y, x). If that's what you want, then by all means use the same multiplier. It not, using a different multiplier would make sense.

How about something like long lhash = prime * (hash1 ^ hash2); then using (int)((lhash >> 32) ^ lhash)? That's something I saw on another question here SO, but it wasn't really explained why it was a good idea to do it like that.

Interesting question. In Java, longs are 64-bit and ints are 32-bit. So this generates a hash using twice as many bits as desired, and then derives the result from the high and low bits combined.

If multiplying a number n by a prime p, and the lowermost k bits of n are all zeros, then the lowermost k bits of the product n * p will also be all zeros. This is fairly easy to see -- if you're multiplying, say, n = 0011 0000 and p = 0011 1011, then the product can be expressed as the sum of two shift operations. Or,

00110000 * p = 00100000 * p + 00010000 * p
             = p << 5 + p << 4

Taking p = 59 and using unsigned 8-bit ints and 16-bit longs, here are some examples.

 64: 0011 1011 * 0100 0000 = [ 0000 1110 ] 1100 0000 (192)
128: 0011 1011 * 1000 0000 = [ 0001 1101 ] 1000 0000 (128)
192: 0011 1011 * 1100 0000 = [ 0010 1100 ] 0100 0000 (64)

By just dropping the high bits of the result, the range of the resulting hash value is limited when the low bits of the non-prime multiplicand are all zeros. Whether that's an issue in a specific context is, well, context-specific. But for a general hash function it's a good idea to avoid limiting the range of output values even when there are patterns in the input numbers. And in security applications, it's even more critical to avoid anything that would let someone make inferences about the original value based on patterns in the output. Just taking the low bits reveals the exact values of some of the original bits. If we make the assumption that the operation involved multiplying an input number with a large prime, then we know that the original number had as many zeros at the right as the hash output (because the prime's rightmost bit was 1).

By XORing the high bits with the low bits, there's less consistency in the output. And more importantly, it's much harder to make guesses about the input values based on this information. Based on how XOR works, it could mean the original low bit was 0 and the high bit was 1, or the original low bit was 1 and the high bit was 0.

 64: 0011 1011 * 0100 0000 = 0000 1110 1100 0000 => 1100 1110 (206)
128: 0011 1011 * 1000 0000 = 0001 1101 1000 0000 => 1001 1101 (157)
192: 0011 1011 * 1100 0000 = 0010 1100 0100 0000 => 0110 1100 (204)
GargantuChet
  • 5,691
  • 1
  • 30
  • 41
  • +1: really nice and detailed answer, in particular the examples. Thanks for taking your time. as for the bitshift question how about something like `long lhash = prime * (hash1 ^ hash2);` then using `(int)((lhash >> 32) ^ lhash)`? That's something I saw on another question here SO, but it wasn't really explained why it was a good idea to do it like that... – posdef Aug 22 '12 at 22:26
  • @posdef, this was an interesting thought exercise and I'm wondering whether there are any gaps in my response. I'm not asking you to accept my answer, just wondering if there's anything I've overlooked since no answers have been accepted yet. – GargantuChet Aug 28 '12 at 20:26
  • to be honest I am not sure, I have been preparing for a conference when I asked this question and now I am actually at that conference. So I wanted to go through you answer, once more in detail, before I respond. It would be a pity to not give it a proper thought seeing that you have put so much time and effort into writing up such a detail answer. :) – posdef Aug 29 '12 at 08:39
  • @posdef No worries, as mentioned the thought process in itself was reward enough. I appreciate the attention to detail, as should anyone who happens to come across this post in the future. Enjoy the conference! – GargantuChet Aug 29 '12 at 16:33
  • so now that I have managed to go through this answer, and particularly the edit about the last part, I have to say it has been very informative, somewhat enlightening even.. :) So to sum it up, the last option of using more bits (from the `long` value) and XOR to itself is a good way to go in distributing the values? – posdef Sep 17 '12 at 14:02
  • I'd agree with that -- using a long and then coalescing into 32 bits seems like a great way to spread the output values, even when there are patterns in the input that would normally reduce variability when using ints alone. – GargantuChet Sep 17 '12 at 19:28
  • @GargantuChet: I would think using a `long` and coalescing into 32 bits should be a normal approach when combining multiple hash values into a larger hash. The performance cost will be slight, but such an approach can make *systemic* collisions all but impossible. If the outer container uses 32-bit math when when combining inner-item hash codes, there's a greater risk that patterns in the data might cause some parts of the hash code computation to cancel out other parts. – supercat Feb 21 '15 at 22:13
4
  • Overflow is not a problem. Hashes are constrained to a narrow value set anyway.

  • The first hash function you posted isn't very good. Doing return (prime * hash1) ^ hash2; ` instead would reduce the number of collisions in most cases.

  • Multiplying by a single word int is generally very fast, and the difference between multiplying by different numbers is negligible. Plus the execution time is dwarfed by everything else in the function anyay

  • Using different prime multipliers for each part may reduce the risk of collisions.

Antimony
  • 37,781
  • 10
  • 100
  • 107
  • This particular hash function is for an object holding a pair of other objects, I want the hash to be the same since {a,b} is the same pair as {b,a}. Regarding to the last statement, by saying that it *may* reduce the risk of collision, you mean it's believed so but there is no proof/theory behind that it would actually reduce the risk of collision? or did I misunderstand? – posdef Aug 22 '12 at 15:58
  • 1
    @posdef Its impossible to prove because there are always pathological case where it won't be true. – Peter Lawrey Aug 22 '12 at 16:38
  • 2
    Given that humans are involved you can say that pathological case will occur more often than you might expect. – Peter Lawrey Aug 22 '12 at 16:40
  • @posdef It will reduce the risk of collision under most cases, but there are always cases where it won't. It's like asking if driving on the right side of the road will make you less likely to get hit. There's always the possibility of people accidentally or maliciously driving on the wrong side of the road to collide with you. – Antimony Aug 22 '12 at 16:41
  • 1
    @PeterLawrey Thanks for putting a smile on my face; sitting at the office late it doesn't happen often that I get a smile out of the blue. :) – posdef Aug 22 '12 at 16:43
  • 1
    An example, String.hashCode() will compute the hashCode once provided its has not been computed before. i.e. if the hashCode is 0, it computes the hashCode. But what if the hashCode is 0, it computes it repeatedly. Well empty string is 0, and the chances of any other being 0 is 4 billion to one. But if you are a hacker you can construct a wide variety of strings, all with a hashCode of 0 which can help you in your DOS attack. Some of them look like normal words. http://stackoverflow.com/questions/2310498/why-doesnt-strings-hashcode-cache-0 – Peter Lawrey Aug 22 '12 at 16:47
  • `"A person who never yodelled an apology, never preened vocalizing transsexuals.".hashCode()` is 0, lol. – Peter Lawrey Aug 22 '12 at 16:49
  • @posdef: Even if you want a transitive hash function, taking the xor of two other hash functions is apt to be a bad idea if the fact that one hash yields 1234 means the other is more likely to also yield 1234 than any other value. I'm not sure why xor is so popular for hashing, given that "+" works just as well in environments that can do "wrapping" arithmetic. – supercat Jan 13 '14 at 18:45
  • Xor and addition have similar properties, assuming modulo arithmetic. – Antimony Jan 13 '14 at 20:57
  • 1
    @Antimony: Some useful properties are similar, but addition is much better in cases where things where two values match are going to be more common than things where they don't. Computing X+X for all `int` values will yield one collision for each. Computing X^X for all `int` values will yield 4,294,967,295 collisions for each. – supercat Feb 21 '15 at 23:41