Hashcode calculation why multiply and ignore overflow bits?

Question

This question is not about why one multiplies, that is fairly obvious - its about distribution.

But rather this is more about one property of multiplication that becomes more important the more factors are included in a hashcode calculation formula.

A simple calculation obviously may overflow but that is of little importance.

a * 31 + b

The real problem is demonstrated when many items are in the formula.

((a * 31) + b) * 31 ... 6n.

Once more than 5 or 6 terms are include the value of the first term is lost as its bits have overflow by the time the hashcode value is up to including the 5+ term. Using this system only the last 5 or so terms are really significant contributors to the final value.

31 ^ 7 > Integer.MAX_VALUE

So why dont most calculations roll the bits that overflow back around and xor w/ the lower bits of the result. I appreciate this requires some bit fiddling and calculations must be done using longs(64 bits) so the top 32 bits can be XOR'd with the integer result but at the very least no bits would be lost.

Is there any particular reason why overflow is ignored ? Its not that costly to use a long as described previously.

EDIT

100000*31^7=            2751261411100000       0x9C641F717C560
6553600000*31^7 180306667837849600000    0xC641F717C5600000

Note that the latter value is exactly 65536 times larger than the previous which also means that its answer is 16 bits larger. Notice that the integer value of 0xC641F717C5600000 is 0xC5600000 the actual significant values are lost from the 16 bit value.

*SAMPLE A*
65536*4096*27512614111  

=7385361114638319616
=0x667E12CDF0000000
   12345678
=0xF0000000

*SAMPLE B*
9*65536*4096*27512614111

=66468250031744876544
=0x9A6EA93D70000000
   12345678
=0x70000000

Notice that the top most bit of SAMPLE B which is exactly 9x SAMPLE A makes almost absolute no difference in the final 32 bit value - if i changed 9x to 17x then the lower bits would be identical. However if the top most bits were not "lost" due to overflow and xord with the lower 32 bits then the value would be different.

score 3 · Answer 1 · answered Feb 10 '11 at 00:57

3

That is the benefit from multiplying by an odd number; the earlier numbers never fall off the end of the integer completely. For an element to be lost, 31^n would need to be a power of 2, and that cannot happen. In your case, with 31^7, for example, you get 0x67E12CDF for a 32-bit number; thus, the input element multiplied by that value will still contribute to the result, despite overflow.

answered Feb 10 '11 at 00:57

Jeremiah Willcock

30,161
7
76
78

Yes but over time only the very low bits are actually present in the hashcode. – mP. Feb 10 '11 at 02:46
@mP: What do you mean? All of the input elements affect the final hash code when you use an odd multiplier. – Jeremiah Willcock Feb 10 '11 at 02:47
@Jeremiah i have replied in my original q w/ some maths and examples of my pt. – mP. Feb 10 '11 at 03:41
@mP: I don't see any new edits in your question. You might need to try the code out to see how it behaves. Just write a function that does what you show in your question, and test it on strings that have different beginnings but the same ending. If you use something like 31 as the constant, the hash codes will change no matter what part of the string you change. – Jeremiah Willcock Feb 10 '11 at 03:43
@JW check for the bolded EDIT, you will see the figures demonstrate how the high bits are eventually lost due to multiplication. I could try and contrive an example in code but the math above should be enough. – mP. Feb 10 '11 at 08:18
@mP: Are those bits "eventually lost" or lost right away? It is true that if you use large numbers as input to the hash code, they will not have much effect, but that will not change as you add more numbers afterwards (try your same example with various powers of 31, and it will behave similarly). If you are hashing strings, though, the issue you show will not appear. – Jeremiah Willcock Feb 10 '11 at 16:33
@JW I guess "eventually" or perhaps "earlier" without actually giving exact definitions of those terms. – mP. Feb 11 '11 at 02:53

score 2 · Answer 2 · answered Feb 10 '11 at 00:56

2

Is there any particular reason why overflow is ignored ? Its not that costly to use a long as described previously.

But there's almost surely no gain from it. This methodology typically produces a good distribution of values to begin with.

answered Feb 10 '11 at 00:56

jason

236,483
35
423
525

2

Not only that, but a long would run into the same problem, just would take a little `long`er. (sorry, that was a bad one...) – corsiKa Feb 10 '11 at 08:22
The entire reason for prime numbers as the multiplying factor is because odds mean that values are shifted to the left and eventually all bits get lost. However primes still have the same prob they are jsut a little better and take longer for bits to disappear. – mP. Feb 11 '11 at 12:03

score 0 · Answer 3 · answered Feb 10 '11 at 14:01

0

I don't see the point in the examples. They seem, to me, unrelated to the way you calculate hash codes : a * 31 + b.

You could, perhaps, find some a's and b's, that would give the same hashcode, (but where the high bits are different). Then it would make sense to xor the high bits back in to the hashcode.

Or, a different example would be for ((a * 31) + b )*31 + ... + z. Then find some a,b,...,z, where the hashcode does not depend on a anymore. So that a won't be a significant contributor.

Of course, if you change 31 by 65536, it is pretty easy to find those a,...,z. Any value will do, all of a bits would simply fall off, a gets shifted to left and the cut off. But, can you do so for 31? Or similar, you could xor the high bits back in. But, why? Can you find a case where it helps?

The problem with 65536 is, that in binary it looks like this 10000000000000000. So, when you multiply a number by it, in binary it'll have those 16 zero's again. For 31, 11111 in binary, that won't happen.

Oh, I din't mean those examples don't exists, because they do (it's just a hash after all). But, you won't find many or similar examples.

answered Feb 10 '11 at 14:01

Ishtar

11,542
1
25
31

The first part was trying quite poorly to demonstrate how bits overflow and disappear from multiplication. Your comment about 65536 is exactly right.The above calculations show that the "hi" order bits become lost rather quickly, thus if the first term has a hashcode of 0x10001 or 0x30001, 0x70001 or 0xffff0001 are quickly lost. – mP. Feb 11 '11 at 02:52
My comments were trying to point out that the act of multiplication introducing 0 bits which could be replaced with some appropriate 1s if the overflow was not ignored. – mP. Feb 11 '11 at 02:54
@mP - You are right about the multiplication. But your question is about hashcode distribution, right? Good distribution and losing high bits are unrelated, **if** you use `31` and not `65536`. – Ishtar Feb 11 '11 at 12:51
using 65536 which is effectively a shift of 16 only shows the prob im highlighting best. Mult by 31 n times will also shift low bits until they overflow. – mP. Jun 04 '12 at 03:11

Hashcode calculation why multiply and ignore overflow bits?

3 Answers3