Best way to calculate a running hash for an unordered_map?

Question

I've got a simple wrapper-class for std::unordered_map that updates a running hash-code for the unordered_map's contents, as key-value pairs are added or removed; that way I never have to iterate over the entire contents to get the current hash code for the set. It does this by adding to the _hash member-variable whenever a new key-value pair is added, and subtracting from the _hash member-variable whenever an existing key-value pair is removed. This all works fine (but see the toy implementation below if you want a code-example of what I mean).

My only concern is that I suspect that simply adding and subtracting values from _hash might not be the optimal thing to do from the perspective of minimizing the likelihood of hash-value collisions. Is there a mathematically better way to compute the running-hash-code for the table, that would still preserve my ability to efficiently add/remove items from the table (i.e. without forcing me to iterate over the table to rebuild a hash code from scratch every time?)

#include <functional>
#include <unordered_map>
#include <string>
#include <iostream>

template<typename KeyType, typename ValueType> class UnorderedMapWithHashCode
{
public:
   UnorderedMapWithHashCode() : _hash(0) {/* empty */}

   void Clear() {_map.clear(); _hash = 0;}

   void Put(const KeyType & k, const ValueType & v)
   {
      Remove(k);  // to deduct any existing value from _hash
      _hash += GetHashValueForPair(k, v);
      _map[k] = v;
   }

   void Remove(const KeyType & k)
   {
      if (_map.count(k) > 0)
      {
         _hash -= GetHashValueForPair(k, _map[k]);
         _map.erase(k);
      }
   }

   const std::unordered_map<KeyType, ValueType> & GetContents() const {return _map;}

   std::size_t GetHashCode() const {return _hash;}

private:
   std::size_t GetHashValueForPair(const KeyType & k, const ValueType & v) const
   {
      return std::hash<KeyType>()(k) + std::hash<ValueType>()(v);
   }

   std::unordered_map<KeyType, ValueType> _map;
   std::size_t _hash;
};

int main(int, char **)
{
   UnorderedMapWithHashCode<std::string, int> map;
   std::cout << "A:  Hash is " << map.GetHashCode() << std::endl;

   map.Put("peanut butter", 5);
   std::cout << "B:  Hash is " << map.GetHashCode() << std::endl;

   map.Put("jelly", 25);
   std::cout << "C:  Hash is " << map.GetHashCode() << std::endl;

   map.Remove("peanut butter");
   std::cout << "D:  Hash is " << map.GetHashCode() << std::endl;

   map.Remove("jelly");
   std::cout << "E:  Hash is " << map.GetHashCode() << std::endl;

   return 0;
}

I'm vaguely curious why you need `GetHashCode()` at all, but that's off topic — Mooing Duck, Jun 17 '21 at 22:19
I see no reason + or - should be a problem, but `^` may also give interesting results. — Mooing Duck, Jun 17 '21 at 22:21
Various reasons; one would be that it helps detect synchronization errors when I've got multiple computers that should be holding the same data structures after an update. I can send out the hash code value along with the update, and any computer whose local table's `GetHashCode()` return-value doesn't match the one included in the network-message knows that something has gone wrong and can log an error to that effect. — Jeremy Friesner, Jun 17 '21 at 22:21
I recommend splitting the hash function into 3 pieces: 1) Initialize; 2) Append; 3) Finish. The append would perform the calculations in the middle of the loop in Hash function (one one value). The Finish step would perform closing stuff, the publicize the result. — Thomas Matthews, Jun 17 '21 at 22:33
In most Hashing Code functions, there is a loop that takes a unit and performs some calculation on it, then repeats. It's this calculation that I am referring to. For example, in CRC, it could be "previous-value XOR present-value". — Thomas Matthews, Jun 17 '21 at 22:40
@ThomasMatthews I think that loop is inside the `std::hash<>()` function in this case, so outside the scope of this question? — Jeremy Friesner, Jun 17 '21 at 22:41
I'd like to extend @Thomas Matthews's proposal: We choose a hash function with state, like `xxHash`, for add operation: we update the hash value with `+{serializedkey}{serializedvalue}`, for remove operation: we update the hash value with `-{serializedkey}{serializedvalue}`; for `GetHashCode`, we return the current hash digest value. — prehistoricpenguin, Jun 18 '21 at 02:23
@prehistoricpenguin how does one negate the input to a function like `XXH64_update()` when doing a key/value removal? — Jeremy Friesner, Jun 18 '21 at 02:37
So you don't care about the history operation but only the current values in hashmap? Then my method seems not practical. — prehistoricpenguin, Jun 18 '21 at 02:42
The main thing is that a given set of key/value pairs in the map returns a consistent (and, as often as possible, unique) hash code, regardless of how the map got to the state it is currently in. — Jeremy Friesner, Jun 18 '21 at 02:48

score 1 · Accepted Answer · answered Jun 19 '21 at 15:32

Your concept's perfectly fine, just the implementation could be improved:

you could take the hash functions to use as template arguments that default to the relevant std::hash instantiations; note that for numbers it's common (GCC, Clang, Visual C++) for std::hash<> to be an identity hash, which is moderately collision prone; GCC and Clang mitigate that somewhat by having prime number of buckets (vs Visual C++'s power-of-2 choice), but you need to avoid having distinct key,value entries collide in the size_t hash-value space, rather than post-mod-bucket-count, so would be better off using a meaningful hash function. Similarly Visual C++'s std::string hash only incorporates 10 characters spaced along the string (so it's constant time), but if your key and value were both similar same-length long strings only differing in a few characters that would be horrible collision prone too. GCC uses a proper hash function for strings - MURMUR32.
return std::hash<KeyType>()(k) + std::hash<ValueType>()(v); is mediocre idea in general and an awful idea when using an identity hash function (e.g. h({k,v}) == k + v, so h({4,2}) == h({2,4}) == h({1,5}) etc.)
consider using something based on boost::hash_combine instead (assuming you do adopt the above advice to have template parameters provide the hash functions:
```
 auto key_hash = KeyHashPolicy(key);
 return (key_hash ^ ValueHashPolicy(value)) +
        0x9e3779b9 + (key_hash << 6) + (key_hash >> 2);
```

you could dramatically improve the efficiency of your operations by avoiding unnecessarily hash table lookups (your Put does 2-4 table lookups, and Remove does 1-3):

void Put(const KeyType& k, const ValueType& v)
{
    auto it = _map.find(k);
    if (it == _map.end()) {
        _map[k] = v;
    } else {
        if (it->second == v) return;
        _hash -= GetHashValueForPair(k, it->second);
        it->second = v;
    }
    _hash += GetHashValueForPair(k, v);
}

void Remove(const KeyType& k)
{
    auto it = _map.find(k);
    if (it == _map.end()) return;
    _hash -= GetHashValueForPair(k, it->second);
    _map.erase(it);
}

if you want to optimise further, you can create a version of GetHashValueForPair that returned the HashKeyPolicy(key) value and let you pass it in to avoid hashing the key twice in Put.

Your second and third points are particularly interesting to me (the other points are largely already addressed in my real/non-toy implementation) -- I'd like to replace the addition and subtraction with something like `hash_combine(`) and `hash_uncombine()`, respectively, but does such a thing as `hash_uncombine()` even exist? — Jeremy Friesner, Jun 19 '21 at 23:28
@JeremyFriesner: The `hash_combine()` approach should be used to combine the hashes of key and value because 1) they might individually be fairly weak hashes, and 2) they might use the same hash function, and you don't want `GetHashForPair(a,b)` to equal `GetHashForPair(b,a)`. Having done that you then have a reasonably strong hash that can be entirely reasonably added to / subtracted from other such hashes. (`hash_uncombine()` doesn't exist and wouldn't work for you - `hash_combine()` output depends on the order of inputs.) — Tony Delroy, Jun 20 '21 at 14:40

Best way to calculate a running hash for an unordered_map?

1 Answers1