4

Looking at What is the best algorithm for an overridden System.Object.GetHashCode? I was struck that in many of the answers that suggest hashcodes of the type hash = hash*(prime) + item.GetHashcode() that the value of hash is initially seeded to another prime rather than 0.

I understand the reason for the prime in the calculation portion coprime numbers are useful in many ways.

What I don't understand is why the hash is initialised to a non-zero number in the first place.

Looking at the precise example:

int hash = 17;
hash = hash * 23 + field1.GetHashCode();
hash = hash * 23 + field2.GetHashCode();
hash = hash * 23 + field3.GetHashCode();
return hash;

For shorthand lets let field1.GetHashCode() be represented with f1 (and so on for the others) and the initial hash value as i then this gives:

int hash = i;
hash = i * 23 + f1;
hash = (i * 23 + f1)* 23 + f2;
hash = ((i * 23 + f1)* 23 + f2)* 23 + f3;

Expanding the brackets in that last row:

hash = (i*23*23 + f1*23 + f2)* 23 + f3;
hash = i*23*23*23 + f1*23*23 + f2*23 + f3;

So as we can see the only effect of the initial hash value is to increase the final has value by a constant value of i*23*23*23 which would generalise to i*23^(number of fields).

So how does this help? In the event of f1, f2, f3 all being 0 is it a problem if the final hash were 0? Is it better for it to be something non-zero? My only thought is that implementations of things like dictionaries or hash sets that use the hash value prefer non-zero values for some reason but I can't think what that reason might be. Or the other things of course that these things are a little bit arcane so people use a tried and tested thing and so the initial value gets propagated even though there is no reason for it.

I tried looking up some microsoft hashcodes but the ones I found all used external code to calculate them (object, string) or were slightly special (the implementation of GetHashCode on anonymous objects seeds the hashcode based off of the property names of the anonymous objects which is different because it isn't a constant initial value).

So in summary why the initial constant value in hash code implementations?

Edit: Why use a prime number in hashCode? was suggested as a duplicate and the site wants me to edit my question to explain why it is not a duplicate... I have acknowledged that primes are used as the multiplier in the calculations and I understand why that is. This question is explicitly about the use as an initial seed in the hash code algorithm. The suggested duplicate doesn't explicitly say what the prime is used for but the answers all address the use of it as a multiplying factor which is not relevant to this question.

Chris
  • 27,210
  • 6
  • 71
  • 92
  • I should note as a post script that while this question is tagged c# it may well apply more generally. The only reason I have put that language tag on is that I don't know if other languages do suggest the same thing so thought I'd keep it to the specific example I was looking at. If anybody with experience in other languages wants to broaden this out then feel free. – Chris Aug 18 '17 at 10:41
  • Possible duplicate of [Why use a prime number in hashCode?](https://stackoverflow.com/questions/3613102/why-use-a-prime-number-in-hashcode) – mjwills Aug 18 '17 at 11:45

1 Answers1

3

This question has some good answers on the Computer Science SE. In short: the initial constant is adapted from hashes that could take a variable number of inputs, and you’re right that it doesn’t matter in that example.

Ry-
  • 218,210
  • 55
  • 464
  • 476
  • Ah, yes. It does make more sense if you have a variable number of items. I think all the examples I have seen have been simple enough that they have just used static numbers of items like the above. – Chris Aug 18 '17 at 10:50
  • That post seems to suffer from the same problem as the OP's analysis, it assumes that the intermediate sums do not overflow. Modular math is tricky, the theory of LCGs could apply here. 17 is the seed of the random number generator, but it is one whose periodicity cannot be guaranteed. So you'd have to pick a seed that maximizes the odds that a longer cycle is picked. Hmm. – Hans Passant Aug 18 '17 at 13:00
  • 1
    @HansPassant: Hmm… does that matter? Even with overflow mod anything, the constant will apply a fixed difference of 23**3 * 17. – Ry- Aug 18 '17 at 13:16