0

In a question about a very simple hashing algorithm called djb2, the author wants to know why the number 33 is chosen in the algorithm (see below code in C).

unsigned long;
hash(unsigned char *str)
{
    unsigned long hash = 5381;
    int c;

    while (c = *str++) //just the character
        hash = ((hash << 5) + hash) + c; /* hash * 33 + c */

    return hash;
}

In the top answer, point 2 talks about the hashing accumulator and how it makes two copies of itself, and then it says something about the spreading.

Can someone explain what is meant by "copying itself" and the "spread" of answer 2?

GoldenRetriever
  • 155
  • 3
  • 11

1 Answers1

1

The step 2 being references is this:

  1. As you can see from the shift and add implementation, using 33 makes two copies of most of the input bits in the hash accumulator, and then spreads those bits relatively far apart. This helps produce good avalanching. Using a larger shift would duplicate fewer bits, using a smaller shift would keep bit interactions more local and make it take longer for the interactions to spread.

33 is 32+1. That means, thanks to multiplication being distributive, that hash * 33 = (hash * 32) + (hash * 1) - or in other words, make two copies of hash, shift one of them left by 5 bits, then add them together, which is what (hash << 5) + hash expresses in a more direct way.

harold
  • 61,398
  • 6
  • 86
  • 164
  • Does this then assume that we are dealing with a 64 bit machine? If so, we can do more operations per word resulting in less I/O to disk? I mean, why is the answer 2 important? – GoldenRetriever May 14 '21 at 08:07
  • @GoldenRetriever 64bit is not necessary for this, and djb2 was not designed with it in mind (it couldn't have been, it dates back to 1991). We can certainly compute a better hash using more arithmetic operations.. djb2 is not a very high quality hash, especially compared to modern 64bit hashes – harold May 14 '21 at 08:12
  • 1. Couldn't we use another multiplicative than 33 and still gain two copies? 2. Why is it that when using a larger shift it would duplicate fewer bits and vice versa? – GoldenRetriever May 14 '21 at 08:25
  • Any sum of two powers of two would make two copies (and you could use more copies for more mixing). Larger shifts throw away more bits from the top of the hash, and don't mix the lowest bits of the hash. Large shifts are actually OK to include, but having *only* large shifts would work very poorly. Consider `0x80000001` as multiplier, it would be very bad. – harold May 14 '21 at 08:28
  • OK. Why is it that a bit shift of 5 gives 33 and not 32? 2^5 is 32 and not 33? – GoldenRetriever May 15 '21 at 11:24
  • 1
    A bit shift of 5 is a multiplication by 32, the multiplication is by 33 here because `hash` itself is also added – harold May 15 '21 at 11:25
  • ah okay. So if we have an unsigned long (32 bits), then as we do bit shifts it will eventually overflow, right? leaving us with the lower bits being representative and in the top of the binary representation. Am I clear? – GoldenRetriever May 15 '21 at 11:44