65

I've read in other posts that this seems to be the best way to combine hash-values. Could somebody please break this down and explain why this is the best way to do it?

template <class T>
inline void hash_combine(std::size_t& seed, const T& v)
{
    std::hash<T> hasher;
    seed ^= hasher(v) + 0x9e3779b9 + (seed<<6) + (seed>>2);
}

Edit: The other question is only asking for the magic number, but I'd like to get know about the whole function, not only this part.

Maarten Bodewes
  • 90,524
  • 13
  • 150
  • 263
keyboard
  • 2,137
  • 1
  • 20
  • 32
  • 4
    Possible duplicate of [Magic number in boost::hash\_combine](http://stackoverflow.com/questions/4948780/magic-number-in-boosthash-combine) – sbabbi Mar 14 '16 at 11:21
  • 1
    So: *So including this number "randomly" changes each bit of the seed; as you say, this means that consecutive values will be far apart. Including the shifted versions of the old seed makes sure that, even if hash_value() has a fairly small range of values, differences will soon be spread across all the bits.*; from the accepted answer does not work for you? – NathanOliver Mar 14 '16 at 12:12
  • 1
    Loaded question. It's not the best way. It's a generically usable one. – sehe Mar 14 '16 at 15:17
  • 3
    Alternative way to hash aggregates of types: Video: https://www.youtube.com/watch?v=Njjp_MJsgt8&feature=youtu.be Paper: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3980.html Implementation: https://github.com/HowardHinnant/hash_append – Howard Hinnant Mar 14 '16 at 19:11
  • Another approach is to overload a "series" template that iterates over all the data in a UDT ( https://github.com/ywkaras/trafficserver/blob/fnv1a/lib/ts/Series.h ). Once the series template is defined for a type, a hash template function will work with the type ( https://github.com/ywkaras/trafficserver/blob/fnv1a/lib/ts/fnv1aHash.h ). (Here is a quick unit test of this code using catch.hpp https://github.com/ywkaras/trafficserver/blob/fnv1a/lib/ts/unit-tests/test_fnv1aHash.cc .) – WaltK Apr 10 '18 at 14:21
  • 2
    `boost::hash_combine` should be deprecated in favour of http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n3980.html – Maxim Egorushkin Jan 05 '21 at 10:29

2 Answers2

56

It being the "best" is argumentative.

It being "good", or even "very good", at least superficially, is easy.

seed ^= hasher(v) + 0x9e3779b9 + (seed<<6) + (seed>>2);

We'll presume seed is a previous result of hasher or this algorithm.

^= means that the bits on the left and bits on the right all change the bits of the result.

hasher(v) is presumed to be a decent hash on v. But the rest is defence in case it isn't a decent hash.

0x9e3779b9 is a 32 bit value (it could be extended to 64 bit if size_t was 64 bit arguably) that contains half 0s and half 1s. It is basically a random series of 0s and 1s done by approximating particular irrational constant as a base-2 fixed point value. This helps ensure that if the hasher returns bad values, we still get a smear of 1s and 0s in our output.

(seed<<6) + (seed>>2) is a bit shuffle of the incoming seed.

Imagine the 0x constant was missing. Imagine the hasher returns the constant 0x01000 for almost every v passed in. Now, each bit of the seed is spread out over the next iteration of the hash, during which it is again spread out.

The seed ^= (seed<<6) + (seed>>2) 0x00001000 becomes 0x00041400 after one iteration. Then 0x00859500. As you repeat the operation, any set bits are "smeared out" over the output bits. Eventually the right and left bits collide, and carry moves the set bit from "even locations" to "odd locations".

The bits dependent on the value of an input seed grows relatively fast and in complex ways as the combine operation recurses on the seed operation. Adding causes carries, which smear things even more. The 0x constant adds a bunch of pseudo-random bits that make boring hash values occupy more than a few bits of the hash space after being combined.

It is asymmetric thanks to addition (combining the hashes of "dog" and "god" gives different results), it handles boring hash values (mapping characters to their ascii value, which only involves twiddling a handful of bits). And, it is reasonably fast.

Slower hash combines that are cryptographically strong can be better in other situations. I, naively, would presume that making the shifts be a combination of even and odd shifts might be a good idea (but maybe addition, which moves even bits from odd bits, makes that less of a problem: after 3 iterations, incoming lone seed bits will collide and add and cause a carry).

The downside to this kind of analysis is that it only takes one mistake to make a hash function really bad. Pointing out all the good things doesn't help that much. So another thing that makes it good now is that it is reasonably famous and in an open-source repository, and I haven't heard anyone point out why it is bad.

Yakk - Adam Nevraumont
  • 262,606
  • 27
  • 330
  • 524
55

It's not the best, surprisingly to me it's not even particularily good. The main problem is the bad distribution, which is not really the fault of boost::hash_combine in itself, but in conjunction with a badly distributing hash like std::hash which is most commonly implemented with the identity function.

boost entropy matrix Figure 2: The effect of a single bit change in one of two random 32 bit numbers on the result of boost::hash_combine . On the x-axis are the input bits (two times 32, first the new hash then the old seed), on the y-axis are the output bits. The color indicate the degree of dependence.

To demonstrate how bad things can become these are the collisions for points (x,y) on a 32x32 grid when using hash_combine as intended, and with std::hash:

# hash_combine(hash_combine(0,x₀),y₀)=hash_combine(hash_combine(0,x₁),y₁)
# hash      x₀   y₀  x₁  y₁
3449074105  6   30   8  15
3449074104  6   31   8  16
3449074107  6   28   8  17
3449074106  6   29   8  18
3449074109  6   26   8  19
3449074108  6   27   8  20
3449074111  6   24   8  21
3449074110  6   25   8  22

For a well distributed hash there should be none, statistically. One could make a hash_combine that cascades more (for example by using multiple more spread out xor-shifts) and preserves the entropy better (for example using bit-rotations instead of bit-shifts). But really what you should do is use a good hash function in the first place, then after that a simple xor is sufficient to combine the seed and the hash, if the hash encodes the position in the sequence. For ease of implementation the following hash does not encode the position. To make hash_combine non commutative any non-commutative and bijective operation is sufficient. I chose an asymmetric binary rotation because it is cheap.

#include <limits>
#include <cstdint>

template<typename T>
T xorshift(const T& n,int i){
  return n^(n>>i);
}

// a hash function with another name as to not confuse with std::hash
uint32_t distribute(const uint32_t& n){
  uint32_t p = 0x55555555ul; // pattern of alternating 0 and 1
  uint32_t c = 3423571495ul; // random uneven integer constant; 
  return c*xorshift(p*xorshift(n,16),16);
}

// a hash function with another name as to not confuse with std::hash
uint64_t distribute(const uint64_t& n){
  uint64_t p = 0x5555555555555555ull; // pattern of alternating 0 and 1
  uint64_t c = 17316035218449499591ull;// random uneven integer constant; 
  return c*xorshift(p*xorshift(n,32),32);
}

// if c++20 rotl is not available:
template <typename T,typename S>
typename std::enable_if<std::is_unsigned<T>::value,T>::type
constexpr rotl(const T n, const S i){
  const T m = (std::numeric_limits<T>::digits-1);
  const T c = i&m;
  return (n<<c)|(n>>((T(0)-c)&m)); // this is usually recognized by the compiler to mean rotation, also c++20 now gives us rotl directly
}

// call this function with the old seed and the new key to be hashed and combined into the new seed value, respectively the final hash
template <class T>
inline size_t hash_combine(std::size_t& seed, const T& v)
{
    return rotl(seed,std::numeric_limits<size_t>::digits/3) ^ distribute(std::hash<T>{}(v));
}

The seed is rotated once before combining it to make the order in which the hash was computed relevant.

The hash_combine from boost needs two operations less, and more importantly no multiplications, in fact it's about 5x faster, but at about 2 cyles per hash on my machine the proposed solution is still very fast and pays off quickly when used for a hash table. There are 118 collisions on a 1024x1024 grid (vs. 982017 for boosts hash_combine + std::hash), about as many as expected for a well distributed hash function and that is all we can ask for.

Now even when used in conjunction with a good hash function boost::hash_combine is not ideal. If all entropy is in the seed at some point some of it will get lost. There are 2948667289 distinct results of boost::hash_combine(x,0), but there should be 4294967296 .

In conclusion, they tried to create a hash function that does both, combining and cascading, and fast, but ended up with something that does both just good enough to not be recognised as bad immediately. But fast it is.

Wolfgang Brehm
  • 1,491
  • 18
  • 21
  • 2
    Good answer. This made me doubt why everyone is using this function instead of something better so I spent a few hours in a hole researching this. Actually std::hash does not list good bit avalanching as a property that implementations need to have, so while you are correct that this is a poor hash function in general, it actually completely satisfies the requirements set up by std::hash. As an example, the std::hash implementation of size_t is usually just the identity function - which is perfectly fine. – Hannes Landeholm May 11 '20 at 00:58
  • @HannesLandeholm hash_combine combines two hashes std::hash just hashes a single object. You want the least collisions in gerenal, so when compressing you need to make sure that every input bit statistically changes every output bit. In general you can't know which input bits are more important, so this maximises the probability to get a different compressed output when a few bits change in the input. When input and output have the same size however, the identity has the least collisions and you might only want to improve the distribution, for hash tables for example. – Wolfgang Brehm May 11 '20 at 09:31
  • @HannesLandeholm It entirely depends on the requirements but they would have to be very specific to make hash_combine the optimal solution. std::hash however is good in many circumstances. And if nothing else always choose the solution with the least computations to achieve the goal. – Wolfgang Brehm May 11 '20 at 09:38
  • The canonical use of boost hash_combine is to implement a std::hash for some custom type, usually when you have a struct or array of many things that can by themselves be std::hash-ed and need to be combined. Therefore boost::hash_combine can be said to be designed for and intended to fulfill std::hash requirements. It's so common that cppreference.com even directly references boost::hash_combine. – Hannes Landeholm May 11 '20 at 15:09
  • It would be interesting if you can provide a practical example of an input sequence that generates collisions and is also organic, a sequence that you could encounter in the real world, which is not just crafted to generate collisions with boost::hash_combine. (std::hash implementations are not required to be collision resistant). I tried various examples, like only modifying top bits, lower bits, sequences, only changing single bits, but did not succeed. – Hannes Landeholm May 11 '20 at 15:28
  • @HannesLandeholm on a 128x128 grid there are 8177 collisions where there should be none for an evenly distributed hash function. And I don't even need to exploit the weak high bits. These are the collisions for a 32x32 gird, {hash,{x₀,y₀},{x₁,y₁}}: {3449074105,{6,30},{8,15}},{3449074104,{6,31},{8,16},{3449074107,{6,28},{8,17}},{3449074106,{6,29},{8,18}},{3449074109,{6,26},{8,19}},{3449074108,{6,27},{8,20}},{3449074111,{6,24},{8,21}},{3449074110,{6,25},{8,22}} – Wolfgang Brehm May 11 '20 at 20:04
  • I cannot reproduce your claim. For example hash_combine_boost(6, 30) != hash_combine_boost(8, 15). (2654436190 vs 2654436290) I looked at all 32x32 integer inputs, and none of them collide. – Hannes Landeholm May 12 '20 at 20:26
  • @HannesLandeholm that's not how you are supposed to use hash_combine_boost. You are supposed to do: `int32_t seed = hash_combine_boost(0,6); seed = hash_combine_boost(seed,30)` But if you do use it like that you can find even more collisions, just a sec... – Wolfgang Brehm May 12 '20 at 20:30
  • Never mind, I realized you are doing hash_combine_boost(hash_combine_boost(0, x), y), yes I see now. – Hannes Landeholm May 12 '20 at 20:35
  • @HannesLandeholm if you do it the other way try these: {2654439093,{49,127},{50,66},{51,1}} {2654441880,{94,118},{95,55},{96,39}} {2654442072,{95,119},{96,103},{97,40}} {2654443237,{113,127},{114,66},{115,1}} {2654441878,{94,120},{95,57},{96,37}} {2654442070,{95,121},{96,101},{97,38}} {2654439821,{62,107},{63,42},{64,4}} {2654439219,{51,123},{52,65},{53,0}} {2654442553,{104,126},{105,61},{106,0}} {2654439835,{62,93},{63,28},{64,18}} {2654439474,{55,127},{56,67},{57,4}} – Wolfgang Brehm May 12 '20 at 20:39
  • 2
    Yes, so it's pretty clear that boost hash_combine is a terrible hash function since you can find these trivial collisions, and it clearly doesn't satisfy the std::hash operator() requirement 5 "For two different parameters k1 and k2 that are not equal, the probability that std::hash()(k1) == std::hash()(k2) should be very small, approaching 1.0/std::numeric_limits::max()." This is what I suspected all along when I saw the implementation. My jaw dropped when I saw how widespread the use of this ad-hoc amature hash function is. – Hannes Landeholm May 12 '20 at 20:44
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/213729/discussion-between-wolfgang-brehm-and-hannes-landeholm). – Wolfgang Brehm May 12 '20 at 20:45
  • 1
    Got here doing a first stint of research into hash combining, as it is very useful. My knowledge is not fantastic. So if I'm wrong, please say so. The test above has a few odd thing imo. 1) Wmath namespace is not actually used 2) in a template, using unary minus on an unsigned type is considered an error in VS. 3) std::numeric_limits::digits-1 evaluates to 30 for uint32, is that what you want? 4) const T m = (std::numeric_limits::digits-1) can be a constexpr What about specializing rol() with intrinsics _rotl (8, 16, 32 or 64) in VS? these exist for all common sizes. – Jan May 14 '20 at 20:57
  • @myself (couldn't edit comment any longer) 3) is wrong, I had read up about signed type. Learning all the time. – Jan May 14 '20 at 21:17
  • @Jan I fixed 1) , I don't see the problem with 2) , which line does the compiler complain about? 4) with c++ 20 rotl is now in the standard, my post is not meant was not meant as a copy and paste solution... I'm planning to update it for a long time, the best solution to the problem is to use a [good integer hash function](https://stackoverflow.com/questions/664014/what-integer-hash-function-are-good-that-accepts-an-integer-hash-key/57556517#57556517) and then you can just xor the two hashes to combine them. – Wolfgang Brehm May 15 '20 at 00:25
  • @WolfgangBrehm I'll add a (partial) solution containing the specialization as I cannot put that into a comment. The line complained about is return (n<>((-c)&m)); gives error C4146: unary minus operator applied to unsigned type, result still unsigned – Jan May 16 '20 at 09:43
  • @Jan Ah yes, I see, my compiler never complained and it really worked, I promise ;) But there is another way, I'll fix it. – Wolfgang Brehm May 16 '20 at 12:10
  • 1
    "then after that a simple xor is sufficient" - this seems to contradict later in the answer where you note that you ought to do something to ensure that your hash_combine does not do the same things to (x,y) and (y,x). Might be worth editing that bit out or changing it, lest readers miss a very important nuance. – Milo Brandt Jan 05 '21 at 00:57
  • @MiloBrandt Right, that is imprecise language, if you look at the implementation it is not just a xor, but a rotation as well. If you want to, you could take it to mean that a good hash function would have to contain information about the position in the sequence as well as the value. I'll try to make that clearer. – Wolfgang Brehm Jan 05 '21 at 09:42
  • 1
    I don't understand the figure nor the table. What do the image axes stand for? What do the colors stand for? Are these two images or one rectangular image? About the table: What do the x's and the y's stand for, exactly? Please make the answer more idiot-proof. – einpoklum Jan 15 '22 at 09:22
  • @einpoklum the colorful figures have on the x axis the input to hash_combine, first the new hash to combine, then the old seed that is carried over. On the y axis is the output, that is the new seed. The color shows the dependence of the output to the input. – Wolfgang Brehm Jan 15 '22 at 09:36
  • @einpoklum the table is the hash collisions for points on a 32x32 grid, that is two points (x₀,y₀) and (x₁,y₁) both having the same hash when using std::hash and hash_combine. – Wolfgang Brehm Jan 15 '22 at 09:40
  • @WolfgangBrehm: 1. Please add this to the answer. 2. I still don't get it. What do "First" and "then" mean in your comment? How do different colors correspond to different extents of dependence? Where does randomness come in? What is meant by "dependence of the output to the input"? I'm sorry to nitpick, but I believe you're assuming people share some assumptions/notions with you which they may not. – einpoklum Jan 15 '22 at 09:42
  • @einpoklum On the x axis I layed out the input bits. boost::hash_combine takes two numbers as input, the old seed and a hash value. To flatten them onto a single axis I first took the 32 bits of the new hash, occupying the space from x=0 to x=32 and then the 32 bits of the old seed, occupying the space from x=32 to x=64. – Wolfgang Brehm Jan 15 '22 at 09:49
  • 1
    The `hash` function is defined, but not used in the presented source code. It uses `std::hash` instead. Why? – plasmacel Jan 15 '22 at 09:55
  • @einpoklum I didn't think the colorscale was important, but I took lots of random numbers for the seed and computed the output, then looked at what the correlation of the output bit n (from 0 to 32) is when changing each input bit m (from 0 to 64) individually. Yellow means the output changes every time, black is the output bit changes never. – Wolfgang Brehm Jan 15 '22 at 09:56
  • @plasmacel a good question, I can see why this is confusing. `std::hash` is defined for many types, `hash` only as an example for integers. But because `std::hash` is not guaranteed to lead to an evenly distributed hash value, the first thing we need to do is to call `distribute` on the output. The source code is not meant as a copy-to-your-project-and-fix-everything, but as an example. A serious project would re-implement the hash function itself to guarantee a well distributed output. – Wolfgang Brehm Jan 15 '22 at 10:01
  • @WolfgangBrehm `distribute` takes an `uint32_t`, while the output of `std::hash` is a `std::size_t`, which is on a 64-bits system will be most likely equivalent to `uint64_t`. Could you edit the answer to provide a 64-bits version of it, just like in https://stackoverflow.com/a/57556517/2430597? – plasmacel Jan 15 '22 at 10:10