63

Can anyone tell me why the number 5381 is used in the DJB hash function?

The DJB hash function is defined as:

  • h 0 = 5381

  • h i = 33h i - 1 + s i

Here's a C implementation:

unsigned int DJBHash(char* str, unsigned int len)
{
   unsigned int hash = 5381;
   unsigned int i    = 0;

   for(i = 0; i < len; str++, i++)
   {   
      hash = ((hash << 5) + hash) + (*str);
   }   

   return hash;
}
Marco Bonelli
  • 63,369
  • 21
  • 118
  • 128
viji
  • 2,706
  • 5
  • 28
  • 34
  • It is a large-ish prime number, which are used as multiplers in most hash algorithms to spread out the values. – user207421 May 02 '20 at 10:39

3 Answers3

83

I stumbled across a comment that sheds some light on what DJB is up to:

/*
* DJBX33A (Daniel J. Bernstein, Times 33 with Addition)
*
* This is Daniel J. Bernstein's popular `times 33' hash function as
* posted by him years ago on comp.lang.c. It basically uses a function
* like ``hash(i) = hash(i-1) * 33 + str[i]''. This is one of the best
* known hash functions for strings. Because it is both computed very
* fast and distributes very well.
*
* The magic of number 33, i.e. why it works better than many other
* constants, prime or not, has never been adequately explained by
* anyone. So I try an explanation: if one experimentally tests all
* multipliers between 1 and 256 (as RSE did now) one detects that even
* numbers are not useable at all. The remaining 128 odd numbers
* (except for the number 1) work more or less all equally well. They
* all distribute in an acceptable way and this way fill a hash table
* with an average percent of approx. 86%.
*
* If one compares the Chi^2 values of the variants, the number 33 not
* even has the best value. But the number 33 and a few other equally
* good numbers like 17, 31, 63, 127 and 129 have nevertheless a great
* advantage to the remaining numbers in the large set of possible
* multipliers: their multiply operation can be replaced by a faster
* operation based on just one shift plus either a single addition
* or subtraction operation. And because a hash function has to both
* distribute good _and_ has to be very fast to compute, those few
* numbers should be preferred and seems to be the reason why Daniel J.
* Bernstein also preferred it.
*
*
* -- Ralf S. Engelschall <rse@engelschall.com>
*/

That's a slightly different hash function than the one you're looking at, though it does use the 5381 magic number. The code below that comment at the link target has been unrolled.

Then I found this:

Magic Constant 5381:

  1. odd number

  2. prime number

  3. deficient number

  4. 001/010/100/000/101 b

There is also this answer to Can anybody explain the logic behind djb2 hash function? It references a post by DJB himself to a mailing list that mentions 5381 (excerpt from that answer excerpted here):

[...] practically any good multiplier works. I think you're worrying about the fact that 31c + d doesn't cover any reasonable range of hash values if c and d are between 0 and 255. That's why, when I discovered the 33 hash function and started using it in my compressors, I started with a hash value of 5381. I think you'll find that this does just as well as a 261 multiplier.

Craig McQueen
  • 41,871
  • 30
  • 130
  • 181
Mark Johnson
  • 14,224
  • 4
  • 28
  • 34
  • Thanks - The last comment is what hit the nail on the head for 5381. – PlanetUnknown Mar 17 '15 at 20:12
  • They're not "slightly different". `(x << 5) + x` is bitwise multiplication. It's equivalent to `x * 33`! On some systems using the bitwise method is faster, or the only way to do multiplication. – bryc Sep 03 '18 at 20:32
38

5381 is just a number that, in testing, resulted in fewer collisions and better avalanching. You'll find "magic constants" in just about every hash algo.

Mahmoud Al-Qudsi
  • 28,357
  • 12
  • 85
  • 125
  • 3
    Those swapped URLs had me laughing. – High Performance Mark May 22 '12 at 09:16
  • @High I'm glad you're in a good humor :) Fortunately, swapping URLs is super-easy as I just had to switch the numbers around. – Mahmoud Al-Qudsi May 22 '12 at 15:12
  • I could not understand the above humor. – Suraj Jain Sep 01 '17 at 18:02
  • 1
    The question was how it was making fewer collisions? I was also laughing aloud. Moreover, the person asking the question accepted the answer without any proof!!!! – AnotherDeveloper Jan 10 '18 at 13:21
  • 2
    djb2 (like fnv1a) actually has [bad avalanche/distribution](https://i.stack.imgur.com/vm0Id.png). They fail even non-strict avalanche criterion, which takes less computing power to calculate. But they _do_ have decent collision rates. :) Often the collision rate is tied to its avalanching behaviour, which would mean djb2 isn't as good as other choices. The closer to all bits being pesuedo-random, the less likely any two values will match. – bryc Sep 03 '18 at 20:24
34

I found a very interesting property of this number may be that can be a reason for that.

5381 is 709th prime.
709 is 127th prime.
127 is 31st prime.
31 is 11th prime.
11 is 5th prime.
5 is 3rd prime.
3 is 2nd prime.
2 is 1st prime.

5381 is the first number for which this happens for 8 times. 5381st prime may exceed the limit of signed int so it is a good point to stop the chain.

Dereleased
  • 9,939
  • 3
  • 35
  • 51
Deep Joshi
  • 482
  • 5
  • 13