-1
uint32_t jenkins_one_at_a_time_hash(const uint8_t* key, size_t length) {
  size_t i = 0;
  uint32_t hash = 0;
  while (i != length) {
    hash += key[i++];
    hash += hash << 10;
    hash ^= hash >> 6;
  }
  hash += hash << 3;
  hash ^= hash >> 11;
  hash += hash << 15;
  return hash;
}

Source: https://en.wikipedia.org/wiki/Jenkins_hash_function

When adding hash<<10 to hash, there's a possibility that it might overflow if string length is big. When hashing a string using polynomial hash function, we take the the mod p value (p is a large prime number within the range of uint32_t) in each iteration, and returned hash % hash_table_length. But in this algorithm, why don't we take hash = (hash + (hash<<10)) % p to prevent overflow? How would we ensure here that the returned hash value stays within the hash table length?

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065

1 Answers1

1

The one-at-a-time hash is designed to take as input a string of bytes and then output a (roughly) uniformly-distributed 32-bit value. Importantly, this means that if you want to use the hash function to distribute keys into a hash table, you’ll need to perform some secondary math to convert the 32-bit hash to a table index because the hash function itself isn’t designed to restrict its output to a smaller number of slots. So, for example, you could mod the result by the number of table slots if you wanted to use it to distribute keys.

The reasoning in the above paragraph essentially boils down to “as initially written, the hash function doesn’t make any attempt to fit into a small number of slots.” But there’s a deeper question here too: why doesn’t the hash function make any provisions for dealing with integer overflow, and why doesn’t it mod its internal values by a prime at each step? That has to do with the design principles guiding the hash function. The one-at-a-time hash is described in an article about building families of hash functions with certain desirable properties using invertible transformations. The idea is to build a hash of a long byte stream by incrementally applying invertible transforms that combine together the existing hash value and new bytes. Doing so has the useful property that if you’ve hashed byte streams X and Y and didn’t get a collision, then there’s a no chance of getting a collision if you append the same byte sequence to both X and Y. The specific operations used in the hash are indeed invertible , but the justifications as to why are subtle and rely on the fact that adding in shifts and allowing the computation to overflow correspond to multiplication modulo 232 by invertible values. In that sense, there are intentionally no mods here because the implicit modulo computed by an overflow performs the “correct” modulo to make all the steps invertible.

This contrasts with polynomial hashing modulo some prime. You’re correct that when using the polynomial hashing strategy you need to be mindful of integer overflow because you specifically don’t want these implicit mods to kick in - after all, you’re modding by a different number! But that’s particular to that specific hashing strategy. There are lots of different ways to build hash functions, each of which uses a different approach.

templatetypedef
  • 362,284
  • 104
  • 897
  • 1,065
  • So, according to what I understood, polynomial hashing is a different strategy and it specifically requires its internal values to be mod by a prime to provide a good distribution. Jenkin's one-at-a-time is another strategy, which doesn't require any internal mod operations (overflow may occur), but in the end I could always mod the final hash by the number of table slots, and end up with a good distribution. Am I right? –  Jan 17 '22 at 05:13
  • Yep! Sounds good. – templatetypedef Jan 17 '22 at 17:57