uint32_t h(const char *kbuf, int ksiz){
uint32_t hash = 751;
const char *rp = kbuf + ksiz;
while(ksiz--){
hash = (hash * 31) ^ *(uint8_t *)--rp;
}
return hash;
}
Why should the hash be calcuted this way,what's the rationale?
uint32_t h(const char *kbuf, int ksiz){
uint32_t hash = 751;
const char *rp = kbuf + ksiz;
while(ksiz--){
hash = (hash * 31) ^ *(uint8_t *)--rp;
}
return hash;
}
Why should the hash be calcuted this way,what's the rationale?
Your hash algorithm follows the same idea which lead to the modified Bernstein hash and Fowler/Noll/Vo (see eg this overview of existing hash algorithms).
XORing bytes is a classical hash algorithm. However, the resulting distribution of hash values is far from optimal, thus, an additional mixing step (in this case multiplication by 31
) was added.
Using 31
as multiplier is explained by Josh Bloch in Effective Java:
The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.
The choice of 33
as multiplier in the original Bernstein hash probably follows similar reasoning; however, if hashing performance is not your main concern, you're probably better off by choosing a multiplier which results in a better distribution. If you don't want to experiment yourself, Fowler/Noll/Vo is probably a good choice.
This implementation attempts to make the input-to-output mapping uniform so that collisions are uniform too. In other words, it tries to avoid situations such that for some hash values there are way to many collisions than for others. It does that by mixing in values of a primitive pseudo-random generator into the resulting hash values. Or you could think of it the other way around, as a PRNG mixing in the input data. Prime numbers (751 and 31) help to achieve uniformity. Of course, there are no guarantees as with carefully chosen inputs you can defy these attempts.
For more details see these articles:
Hash Function - Uniformity
Linear congruential generator