9

Currently Boost has hash_combine function that outputs 32 bit unsigned integer (to be precise, size_t). Some references:

http://www.boost.org/doc/libs/1_43_0/doc/html/hash/reference.html#boost.hash_combine

http://www.boost.org/doc/libs/1_43_0/doc/html/hash/combine.html

Magic number in boost::hash_combine

I'd like to explore on how to create 64 bit version of hash_combine.

The first thing is to get golden ratio or any other irrational number in 64 bit.

The second part is to use shifts. This part rather tricky and I'd like to ask if there are best practices or guide on using shifts to get hash values? Or choosing shifts like the original code:

seed ^= hash_value(v) + 0x9e3779b9 + (seed << 6) + (seed >> 2); 

is totally random?

Also how to evaluate the output of hash_combine to make sure that it doesn't create more collisions than the original hash function hash_value?

Community
  • 1
  • 1
Viet
  • 17,944
  • 33
  • 103
  • 135
  • 4
    2^64/φ is `0x9E3779B97F4A7C15`. – Kerrek SB Dec 15 '11 at 01:17
  • Thanks Kerrrek. Finding the value is not a problem. What I am interested in is are there any rules or best practices to use shifts and additions as you see in the boost::hash_combine. Or choosing shifts and additions are totally random. – Viet Dec 15 '11 at 01:20
  • I think you should [file a bug report](http://svn.boost.org/trac/boost/newticket). – kennytm Dec 15 '11 at 01:36
  • Hi Kenny, I'm asking about concepts that will enable me to write after understanding them. – Viet Dec 15 '11 at 01:39

3 Answers3

4

If you only want a hash_combine that hashes 2 64 bit values into one, and you don't need a new hash function for strings, you could just lift a tiny bit of code from CityHash, something like this (assuming size_t is a 64 bit unsigned integer, add your favorite bit of preprocessor or template trickery to validate that):

template <class T> inline void hash_combine(std::size_t& seed, const T& v)
{
    std::hash<T> hasher;
    const std::size_t kMul = 0x9ddfea08eb382d69ULL;
    std::size_t a = (hasher(v) ^ seed) * kMul;
    a ^= (a >> 47);
    std::size_t b = (seed ^ a) * kMul;
    b ^= (b >> 47);
    seed = b * kMul;
}

(I think reproducing this snippet here and elsewhere is OK because it doesn't constitute a 'substantial portion' of the CityHash code, but please check the CityHash sources & license agreement to decide for yourself)

  • 3
    your magic constant is not the one that Kerred mentions `0x9E3779B97F4A7C15` so where does it come from ? – v.oddou Apr 22 '15 at 03:39
2

Read http://burtleburtle.net/bob/hash/doobs.html for some basic information on hash function design, and the rest of the articles in http://burtleburtle.net/bob/hash/ for more detailed information. CityHash was tested using http://code.google.com/p/smhasher/, and you can probably test your hash_combine using the same testsuite.

Although I'm not an expert in hashing, the designs of recent hash functions lead me to believe that the 2-shift technique boost's hash_combine() uses is no longer state-of-the-art and can be improved on.

Jeffrey Yasskin
  • 5,171
  • 2
  • 27
  • 39
0

boost::hash_combine is not totally random, it's not even well distributed or particularly good.

A good way to combine two hashes is to first make sure both hashes are well distributed, then you can combine the two with xor. To ensure they are well distributed use a good integer hash function.

Putting it all together you might have:

uint64_t xorshift(const uint64_t& n,int i){
  return n^(n>>i);
}
uint64_t hash(const uint64_t& n){
  uint64_t p = 0x5555555555555555; // pattern of alternating 0 and 1
  uint64_t c = 17316035218449499591ull;// random uneven integer constant; 
  return c*xorshift(p*xorshift(n,32),32);
}
uint64_t hash_combine(const uint64_t& seed, const uint64_t& v) {
  uint64_t c = 17316035218449499591ull;// random integer constant;
  return hash(v)^(seed+c);
}

If the distribution of hash is not good enough for your purposes just doubly hash the values, maybe like so: hash(hash(v))^seed

Wolfgang Brehm
  • 1,491
  • 18
  • 21
  • 1
    This code fails for two important use cases common in programming: 1) Permutations. The hash_combine you propose cannot distinguish different orders of hashed values: hash_combine(hash_combine(0,1),2)==hash_combine(hash_combine(0,2),1). For example, I needed hash function in raytracing task to hash orders of objects a ray hit. 2) Zero preservation. hash_combine(0,0)==0. This is VERY dangerous since 0 is the most common value. Proposed hash function cannot distinguish sequence (0) from (0,0) and (0,0,0). These defects combined give hash(1,0,2)==hash(2,1)==hash(0,0,0,1,2) and so on. – Anton Sukhinov Sep 08 '20 at 11:45
  • 1
    You say that Boost hash_combine is bad, but it actually does not have these defects. – Anton Sukhinov Sep 08 '20 at 11:47
  • @AntonSukhinov That's true. There are several solutions, add in the position `i' to the hashing. `return hash(v*(2*i+1))^seed;` , apply the hashing function to an integer of twice the width in this case a `unsigned __int128` composed of value and seed, multiply by 3 each time, apply another weak hash function before returning, I think adding a random constant to the hash is probably the most economical. – Wolfgang Brehm Sep 08 '20 at 12:04