47

I have some vector of integer that I would like to store efficiently in a unordered_map in c++11 my question is this:

How do I best store these and optimize for .find queries?

I came up with the following hasher:

class uint32_vector_hasher {
public:
  std::size_t operator()(std::vector<uint32_t> const& vec) const {
    std::size_t ret = 0;
    for(auto& i : vec) {
      ret ^= std::hash<uint32_t>()(i);
    }
    return ret;
  }
};

and then store the objects in an unordered_map I do however have a couple of questions

  1. how often does the hash get calculated, only one, some random number or times?
  2. Would it make sense to create a wrapper object with == and hash functions to make memorize the hash and avoid it being calculated more than once?

When profiling I've noticed that a rather large amount of my cpu time is spend doing lookups on the unordered maps, this is not exactly optimal :(

Martin Kristiansen
  • 9,875
  • 10
  • 51
  • 83
  • 1
    Hash gets done once for each insert and once each look-up, and potentially again for each object every time the underlying table is resized. – RichardPlunkett Dec 11 '13 at 05:42
  • 1
    to clarify, the system does not hash things that are already in the table, just because you are looking something up, rather just the key you are using for lookup. – RichardPlunkett Dec 11 '13 at 05:44
  • 2
    btw, xor is an appalling hash combiner – RichardPlunkett Dec 11 '13 at 05:44
  • The best way to hash your vector really depends on the nature of the data. Also, if you're doing a lot of lookups but your data changes infrequently, you might cache the hash value in the structure itself, so the hasher just returns the precomputed value. – Joe Z Dec 11 '13 at 05:46
  • 8
    Some examples to expand on @RichardPlunkett's comment about xor being a bad combiner: if two vectors have the same data but in a different order, they'll have the same hash. If a vector has more than one of the same value, most of those values won't factor into the combined hash (they cancel each other out). If values are typically small (or rather don't use the full range of bits in the `uint32_t`) then the most significant bits won't be used in the combined hash. – Michael Burr Dec 11 '13 at 06:35
  • @MichaelBurr I sort my vectors before inserting. – Martin Kristiansen Dec 11 '13 at 07:54
  • 1
    @Martin, sort is unlikely to help, except that it means one particular terrible example wont come up, many others exist. – RichardPlunkett Dec 11 '13 at 08:06
  • @RichardPlunkett, What i ment to say was that I consider vectors more like sets in the case - I don't care about sequence. – Martin Kristiansen Dec 11 '13 at 20:13
  • Could anyone tell me: what is `std::hash` used for in the above code piece? – Alfred Nov 04 '21 at 02:57

4 Answers4

47

So, when not wanting to use boost, Michael Blurr's comment led to the following hash function implementation:

std::size_t operator()(std::vector<uint32_t> const& vec) const {
  std::size_t seed = vec.size();
  for(auto& i : vec) {
    seed ^= i + 0x9e3779b9 + (seed << 6) + (seed >> 2);
  }
  return seed;
}

Seems to work.

Edit: see's answer is a little bit slower, but indeed yields a better hash distribution. I'd go with that one.

HolKann
  • 678
  • 9
  • 15
  • 3
    It's a starting seed. It will quickly get mutated by the xor operations (`^=`). `vec.size()` (line 2) would probably be better, as it takes more vector information into account. I've adjusted the reply. – HolKann Mar 31 '16 at 13:03
  • 3
    std::accumulate can also be applied https://gcc.godbolt.org/z/vaK4h4dce – Aykhan Hagverdili Mar 23 '22 at 02:17
  • 1
    Depending on the properties of the distribution of the vector components, it might be overkill to combine all of them. –  Jul 18 '22 at 14:27
9

The hash function in the currently highest voted answer by HolKann results in a high collision rate for numerous vectors that all contain elements from a small continuous distribution.

To combat this, bits of each element are distributed evenly (algorithm taken from Thomas Mueller's answer).

std::size_t operator()(std::vector<uint32_t> const& vec) const {
  std::size_t seed = vec.size();
  for(auto x : vec) {
    x = ((x >> 16) ^ x) * 0x45d9f3b;
    x = ((x >> 16) ^ x) * 0x45d9f3b;
    x = (x >> 16) ^ x;
    seed ^= x + 0x9e3779b9 + (seed << 6) + (seed >> 2);
  }
  return seed;
}
see
  • 388
  • 4
  • 13
  • 3
    Nice. I've used Thomas Mueller's answer recently as well. Maybe expand your answer with a 64 bit version? – HolKann Jul 18 '22 at 14:22
2

boost::hash_combine is good enough but not particularly good

HolKann's answer is good enough, but I'd recommend using a good hash for each entry and then combining them. The problem is std::hash is not a good hash and boost::hash_combine is not strong enough to make up for that.

template<typename T>
T xorshift(const T& n,int i){
  return n^(n>>i);
}

uint32_t hash(const uint32_t& v) {
  uint32_t p = 0x55555555ul; // pattern of alternating 0 and 1
  uint32_t c = 3423571495ul; // random uneven integer constant; 
  return c*xorshift(p*xorshift(n,16),16);
}

// if c++20 rotl is not available:
template <typename T,typename S>
typename std::enable_if<std::is_unsigned<T>::value,T>::type
constexpr rotl(const T n, const S i){
  const T m = (std::numeric_limits<T>::digits-1);
  const T c = i&m;
  return (n<<c)|(n>>((T(0)-c)&m)); // this is usually recognized by the compiler to mean rotation, also c++20 now gives us rotl directly
}

class uint32_vector_hasher {
public:
  std::size_t operator()(std::vector<uint32_t> const& vec) const {
    std::size_t ret = 0;
    for(auto& i : vec) {
      ret = rotl(ret,11)^hash(i);
    }
    return ret;
  }
};
Wolfgang Brehm
  • 1,491
  • 18
  • 21
1

I tried see's answer to solve a leet code problem. But for some inputs, the function would overflow ints. So, I reverted to your approach. But, your function causes lots of collisions if you have elements like: {0}, {0, 0}, {0, 0, 0}, etc. because hash of int is the number itself and all these hash to 0.

I tweaked it slightly to include the index to reduce the collision rate:

struct hash {
    std::size_t operator()(std::vector<int> const& vec) const {
        std::hash<uint32_t> h;
        std::size_t ret = vec.size();
        for(auto& i : vec) {
            ret ^= h(i) | i;
        }
        return ret;
    }
};

I am just Oring the hash with the index so {0}, {0, 0}, {0, 0, 0} produce different hashes. Its a very bad hash function but it works for my purposes :P

Hemil
  • 916
  • 9
  • 27
  • What happens if you just cast all `int`s to `uint32_t`? Then see's answer should still work, and I guess that would be faster and with less collisions than the std hash (based on Thomas Mueller's arguments). – HolKann Jan 12 '23 at 11:14
  • @HolKann I dont know. But I guess the answer should still be the same because hash of ints is the value of the int. – Hemil Jan 12 '23 at 11:35