11

Is this a correct implementation of the Knuth multiplicative hash.

int hash(int v)
{
    v *= 2654435761;
    return v >> 32;
}

Does overflow in the multiplication affects the algorithm?

How to improve the performance of this method?

José
  • 3,041
  • 8
  • 37
  • 58
  • 1
    You almost certainly want to use `unsigned int` (or `unsigned long long`, since it appears to depend on a size >32 bits) instead of plain `int`. – Jerry Coffin Aug 08 '12 at 18:57
  • Yes, overflow will definitely prevent this from working. In fact if your `int` is typical this code will always return 0 or -1. – Mark Ransom Aug 08 '12 at 18:58
  • Your bit shift is (if int is 32bit) to far. How many bits should your hash have? Substract them from 32 – Fox32 Aug 08 '12 at 18:58
  • @Fox32 ok, so if i want 32bit hash, i don't need to shift bits – José Aug 08 '12 at 19:02
  • @Fox32 is drawing the wrong conclusion, yes you do have to shift bits. You just need to start with an integer type that is larger than 32 bits, such as `uint64_t`. – Mark Ransom Aug 08 '12 at 19:05
  • 1
    Assuming typical circumstances, the overflow is actually OK (not portable, though). The shift is weird (tries to throw away all bits), but accidentally and non-portably typically works out OK (shift amount is taken modulo 32 for 32bit arguments on x86). – harold Aug 08 '12 at 19:05
  • @MarkRansom the Knuth multiplicative hash I know takes the multiplication modulo 2^32, ie the lower half, not the upper half. – harold Aug 08 '12 at 19:06
  • @harold I don't have a copy of Knuth on hand to check and Google wasn't being very helpful either. I have no reason to doubt you, but that would just be another knock against this implementation. You might consider leaving that as an answer. – Mark Ransom Aug 08 '12 at 19:24
  • @MarkRansom I'd better look it up and be sure first.. – harold Aug 08 '12 at 19:38

4 Answers4

29

Knuth multiplicative hash is used to compute an hash value in {0, 1, 2, ..., 2^p - 1} from an integer k.

Suppose that p is in between 0 and 32, the algorithm goes like this:

  • Compute alpha as the closest integer to 2^32 (-1 + sqrt(5)) / 2. We get alpha = 2 654 435 769.

  • Compute k * alpha and reduce the result modulo 2^32:

    k * alpha = n0 * 2^32 + n1 with 0 <= n1 < 2^32

  • Keep the highest p bits of n1:

    n1 = m1 * 2^(32-p) + m2 with 0 <= m2 < 2^(32 - p)

So, a correct implementation of Knuth multiplicative algorithm in C++ is:

std::uint32_t knuth(int x, int p) {
    assert(p >= 0 && p <= 32);

    const std::uint32_t knuth = 2654435769;
    const std::uint32_t y = x;
    return (y * knuth) >> (32 - p);
}

Forgetting to shift the result by (32 - p) is a major mistake. As you would lost all the good properties of the hash. It would transform an even sequence into an even sequence which would be very bad as all the odd slots would stay unoccupied. That's like taking a good wine and mixing it with Coke. By the way, the web is full of people misquoting Knuth and using a multiplication by 2 654 435 761 without taking the higher bits. I just opened the Knuth and he never said such a thing. It looks like some guy who decided he was "smart" decided to take a prime number close to 2 654 435 769.

Bare in mind that most hash tables implementations don't allow this kind of signature in their interface, as they only allow

uint32_t hash(int x);

and reduce hash(x) modulo 2^p to compute the hash value for x. Those hash tables cannot accept the Knuth multiplicative hash. This might be a reason why so many people completely ruined the algorithm by forgetting to take the higher p bits. So you can't use the Knuth multiplicative hash with std::unordered_map or std::unordered_set. But I think that those hash tables use a prime number as a size, so the Knuth multiplicative hash is not useful in this case. Using hash(x) = x would be a good fit for those tables.

Source: "Introduction to Algorithms, third edition", Cormen et al., 13.3.2 p:263

Source: "The Art of Computer Programming, Volume 3, Sorting and Searching", D.E. Knuth, 6.4 p:516

InsideLoop
  • 6,063
  • 2
  • 28
  • 55
  • Isn't the main reason people "forgot to take the higher p bits" actually just that they were using 4-bit unsigned integers, for which `p = 32`, so `32 - p` is 0? – Andy Dec 21 '18 at 20:40
  • Also note that if using the full 32 bits, the hash will transform even numbers to even numbers. – Andy Dec 26 '18 at 20:09
16

Ok, I looked it up in TAOCP volume 3 (2nd edition), section 6.4, page 516.

This implementation is not correct, though as I mentioned in the comments it may give the correct result anyway.

A correct way (I think - feel free to read the relevant chapter of TAOCP and verify this) is something like this: (important: yes, you must shift the result right to reduce it, not use bitwise AND. However, that is not the responsibility of this function - range reduction is not properly part of hashing itself)

uint32_t hash(uint32_t v)
{
    return v * UINT32_C(2654435761);
    // do not comment about the lack of right shift. I'm not ignoring it. read on.
}

Note the uint32_t's (as opposed to int's) - they make sure the multiplication overflows modulo 2^32, as it is supposed to do if you choose 32 as the word size. There is also no right shift by k here, because there is no reason to give responsibility for range-reduction to the basic hashing function and it is actually more useful to get the full result. The constant 2654435761 is from the question, the actual suggested constant is 2654435769, but that's a small difference that as far as I know does not affect the quality of the hash.

Other valid implementations shift the result right by some amount (not the full word size though, that doesn't make sense and C++ doesn't like it), depending on how many bits of hash you need. Or they may use an other constant (subject to certain conditions) or an other word size. Reducing the hash modulo something is not a valid implementation, but a common mistake, likely it is a de-facto standard way to do range-reduction on a hash. The bottom bits of a multiplicative hash are the worst-quality bits (they depend on less of the input), you only want to use them if you really need more bits, while reducing the hash modulo a power of two would return only the worst bits. Indeed that is equivalent to throwing away most of the input bits too. Reducing modulo a non-power-of-two is not so bad since it does mix in the higher bits, but it's not how the multiplicative hash was defined.

So to be clear, yes there is a right shift, but that is range reduction not hashing and can only be the responsibility of the hash table, since it depends on its internal size.

The type should be unsigned, otherwise the overflow is unspecified (thus possibly wrong, not just on non-2's-complement architectures but also on overly clever compilers) and the optional right shift would be a signed shift (wrong).

On the page I mention at the top, there is this formula:

knuth formula

Here we have A = 2654435761 (or 2654435769), w = 232 and M = 232. Calculating AK/w gives a fixed-point result with the format Q32.32, the mod 1 step takes only the 32 fraction bits. But that's just the same thing as doing a modular multiplication and then saying that the result is the fraction bits. Of course when multiplied by M, all the fraction bits become integer bits because of how M was chosen, and so it simplifies to just a plain old modular multiplication. When M is a lower power of two, that just right-shifts the result, as mentioned.

Community
  • 1
  • 1
harold
  • 61,398
  • 6
  • 86
  • 164
  • 2
    ["CS 3110 Lecture 21: Hash functions: Multiplicative hashing"](http://www.cs.cornell.edu/courses/cs3110/2008fa/lectures/lec21.html) claims that "The division by 2^q is crucial. The common mistake when doing multiplicative hashing is to forget to do it". – David Cary Oct 29 '15 at 15:09
  • 1
    @DavidCary this is fine, what I've done here is choosing M=2^32, but when not all 32 bits are desired it's the bottom bits that must be discarded. – harold Oct 29 '15 at 16:44
  • 1
    @harold This Wikipedia entry https://en.wikipedia.org/wiki/Hash_table#Choosing_a_good_hash_function claims that multiplicative hashing has poor clustering and hence is not suitable for open addressing scheme (https://en.wikipedia.org/wiki/Open_addressing). How true is it? If it is the case, is there an alternative? Thanks very much – bytefire Jan 11 '16 at 14:42
  • 2
    @bytefire in my experience it's not bad. Note that the source of that claim used the bottom bits, which is the worst thing to do, so it's not surprising they got a bad result. The bottom bits of a product do not depend on any higher bits from the input, so it's equivalent to throwing most of the key away before hashing. – harold Jan 11 '16 at 15:12
  • 1
    @harold, I second David Cary's comment. Your function is a 'modular hashing', rather than 'multiplicative hashing', it is pretty much equivalent to `h(k) = k mod n` with n being a power of 2, which is not optimal. You have to use high bits of the multiplication result as you mentioned by yourself. – maximk Dec 05 '16 at 07:35
  • 1
    @maximk I *am* using the high bits. Those high bits are not the upper half of the multiplication (see the `mod 1` in the original formula), but rather the upper bits of the low half, and *those bits are all in there*. The result of this has to be shifted right rather than reduced by modulo, but I can't already do that because that would mean I've already chosen the final size. E: and this function is pretty clearly not equivalent to `k mod n`. There is no modulo by `n` anywhere, `n` isn't even an input. – harold Dec 05 '16 at 12:16
  • 1
    @harold I think would help to add more clarification to your answer because your point about the bottom bits being worst really confused me too, since the `mod 1` is throwing away the top 32 bits of the multiplication. – Andy Dec 26 '18 at 19:40
  • 1
    @Andy ok but could you perhaps give a more specific suggestion for improvement? Honestly I've been trying for years to improve this answer and it just doesn't work, it keeps being misinterpreted - I've requested deletion too but the mods haven't been helpful with that so far – harold Dec 26 '18 at 20:11
  • 2
    @harold fair enough. I think talking about concrete numbers of bits would make it clear that you're talking about the already-overflowed result of the multiplication: "if you need to reduce the resulting 32-bit hash value to a smaller number, say 24 bits, because something in your system requires a 24-bit value, then you should use the top 24 bits (i.e. shift right by 8) rather than using the bottom 24 bits, since the top bits depend on more of the input." – Andy Dec 26 '18 at 21:10
2

Might be late, but heres a Java Implementation of Knuth's Method :

For a hashtable of Size N :

public long hash(int key) {
    long l = 2654435769L;
    return (key * l >> 32) % N ;
}
because_im_batman
  • 975
  • 10
  • 26
0

If the input argument is a pointer then I use this

#include <inttypes.h>

uint32_t knuth_mul_hash(void* k) {
  ptrdiff_t v = (ptrdiff_t)k * UINT32_C(2654435761);
  v >>= ((sizeof(ptrdiff_t) - sizeof(uint32_t)) * 8); // Right-shift v by the size difference between a pointer and a 32-bit integer (0 for x86, 32 for x64)
  return (uint32_t)(v & UINT32_MAX);
}

I usually use this as the default fallback hashing function in hashmap implementations, dictionaries, sets, etc...

couven92
  • 28
  • 6