3

I'm trying to debug a caching issue in our codebase, where we use std::unordered_map to hash a struct custom_key which is defined below, but I believe it should only be POD datatypes so I think whatever the C++ library implemented for the hash function will be used. I am wondering if this function is deterministic?

struct custom_key {
  // bunch of uint8_t variables
  // some int8_t variables
  // some boolean variables
  another_struct params;
}

struct another_struct {
  // an enum variable
  // some int variables
  int dimensions[5];
}

Our caching algorithm is

// some_input_key is coming in from a stream
std::unordered_map<custom_key, some_output_type, custom_hasher<custom_key>, custom_equal<custom_key>> ht;
if (ht.find(some_input_key) != ht.end()) {
  // do something
} else {
  ht[some_input_key] = ....
}

In our input stream, we have a set of custom_keys where there may be identical values, but we're unable to find some subsequent identical keys in the hash table.

e.g., our stream would be key1, key2, key3, key4, .... and suppose key1 and key4 are identical, sometimes it will not be able to find key4 in the hash table. I'm pretty sure the hashing function is deterministic, so I'm perplexed what could be causing the issue. I've printed the contents of both keys and they look identical.

Edit: it appears that we do have a custom hash function

template <typename T>
// based on Fowler–Noll–Vo hash function
struct custom_hasher {
  static_assert(std::is_pod<T>::value, "T is not POD");

  size_t operator()(const T& input) const {
    auto ptr = reinterpret_cast<const uint8_t*>(&input);
    uint32_t value = 0x811C9DC5;
    // we have an implementation for irange that does something similar to the one in python
    for (const auto i : irange((int)sizeof(T))) {
      value ^= ptr[i];
      value *= 0x01000193;
    }
    return (size_t)value;
  }
};

template <typename T>
struct custom_equal {
  static_assert(std::is_pod<T>::value, "T is not POD");

  bool operator()(const T& lhs, const T& rhs) const {
    auto ptr1 = reinterpret_cast<const uint8_t*>(&lhs);
    auto ptr2 = reinterpret_cast<const uint8_t*>(&rhs);
    return memcmp(ptr1, ptr2, sizeof(T)) == 0;
  }
};
24n8
  • 1,898
  • 1
  • 12
  • 25
  • 1
    Have you tried printing the hashes of both keys and seeing if they actually _are_ identical? By default it's `std::hash`, although if you've used a custom hash function that isn't shown we can but speculate about its effects. It's also possible that the `//do something` you elide over isn't having the effects you think it is. – Nathan Pierson Apr 26 '22 at 21:06
  • @NathanPierson I will check. I'm pretty sure we use `std::hash` but I will see if there's a custom defined hash function – 24n8 Apr 26 '22 at 21:07
  • 1
    Is there any implementation of `==` for the key? – fabian Apr 26 '22 at 21:12
  • Also, I tried to do `std::cout << std::hash{}(input_key) << std::endl; ` and I got the compilation `error: no match for call to ‘(std::hash<{anonymous}::custom_key>) ({anonymous}::custom_key&)’ std::cout << std::hash{}(input_key) << std::endl;` Is this saying there's no `std::hash` support for `input_key`? – 24n8 Apr 26 '22 at 21:17
  • @fabian Ah I see, it seems we do have a custom `==` and hash function for the key. Sorry about that. I will edit into the OP – 24n8 Apr 26 '22 at 21:21
  • My guess would be that `custom_key` has some padding in it, which is read by the custom hash function. I'm not sure if that's UB, but two keys having the same values in them may have different padding byte values, which would explain the weird behaviour. – IlCapitano Apr 26 '22 at 21:41
  • 1
    Your hasher looks like it'll be affected by padding between data fields of `custom_key`. Likewise your equality operator. – G.M. Apr 26 '22 at 21:41
  • Sorry, I'm not familiar with "padding". Is this related to data type alignment? (I didn't implement the hashing part, so I would need to consult with a colleague on this) – 24n8 Apr 26 '22 at 21:42
  • Does the order in which the elements in `custom_key` is defined/declared affect anything in this case? – 24n8 Apr 26 '22 at 21:44
  • There's a comment saying the hash function is based on Fowler–Noll–Vo hash function – 24n8 Apr 26 '22 at 21:47
  • 2
    [Why isn't sizeof for a struct equal to the sum of sizeof of each member?](https://stackoverflow.com/questions/119123/why-isnt-sizeof-for-a-struct-equal-to-the-sum-of-sizeof-of-each-member) talks about padding and alignment with some examples. – Retired Ninja Apr 26 '22 at 21:48
  • @RetiredNinja I'll look at this in more detail in a couple hours, but wouldn't `sizeof(Params)` in the hasher ignore padding? – 24n8 Apr 26 '22 at 21:54
  • 1
    No, `sizeof` includes the trailing padding. Even if it didn't, there might be padding between members, which obviously can't be ignored by playing with the size. – HolyBlackCat Apr 26 '22 at 21:58
  • 1
    No, that is the entire point of the question I linked. `struct foo { char a; int b; };`. `sizeof(foo) == 8` and there's 3 bytes of potentially uninitialized memory between `a` and `b`. If you have two instances the values of the members could be the same, but if you `memcmp` and that padding is different then the structs are different. – Retired Ninja Apr 26 '22 at 22:01
  • @RetiredNinja Ah I see. So in our example, the padding could be different between the two instances of `foo`? There is no set way that the instances will be padded? – 24n8 Apr 27 '22 at 00:05
  • Also note that we `memset` `params` to zero. I think if we memset the `custom_key` to zero, then this can resolve the problem? – 24n8 Apr 27 '22 at 00:15
  • Yes, one way to solve this is to *always* memset the struct to 0 before assigning values to the members. Hopefully it isn't binary serialized anywhere because that'll be fun to untangle... – Retired Ninja Apr 27 '22 at 00:25
  • @RetiredNinja Makes sense. Hoping that's not the case too. I was thinking about doing `memset` in a constructor for `custom_key` but the realized that would make `custom_key` no longer a POD. So I think I need to have a regular function that does the `memset`'ing after a `custom_key` object is created? Doesn't seem very elegant to me so not sure if there's another approach – 24n8 Apr 27 '22 at 00:29
  • `memset` is an ugly option... it's better to just work through the fields and make sure they're all initialised... the easiest way is to default-initialise them using trailing `{}` in the member definitions: `struct X { sometype v1{}; anothertype v2{}; };`, but you could add a constructor. Separately, you should also think about `int dimensions[5];` - are all elements always in use? If there's another member that tracks how many dimensions there are, then *don't* include the unused or previously-used values in the hash unless they're set back to a known value (e.g. 0). – Tony Delroy Apr 27 '22 at 16:52
  • @TonyDelroy I thought the purpose of `memset` was to initialize the padded values. I may not totally understand how the padded values work, but my understanding was that they have no variable name? e.g., in the `struct foo {char a; int b;}` case above, there's a 3 byte padding. How do we initialize those 3 bytes using your strategy? I think we want to avoid adding a custom constructor because that will make the a struct a non-POD. ditto on the `dimensions` thing – 24n8 Apr 27 '22 at 17:08
  • 1
    @24n8: you're right - `memset` could be used there, or you can change the `struct` to e.g. `{char a{}; char padding[3]{}; int32_t b{}; }`, or you can `#pragma pack(1)` to avoid padding, or you can force zero initialisation of everything including the padding when you create an instance (e.g. `foo x{};` or `foo* p = new foo{};` instead of `foo x;` or `foo* p = new foo;` (though these days people tend to use `make_unique` - I don't believe there's control over zero initialisation then. Yeah - not sure why POD is important, but could definitely limit your options. – Tony Delroy Apr 27 '22 at 18:40
  • @TonyDelroy Wait, `foo x{}` does zero initialization by default? I never knew that. What about `foo x()`? https://stackoverflow.com/questions/1069621/are-members-of-a-c-struct-initialized-to-0-by-default seems it does for structs. not sure if it applies to classes too. interesting, never knew that – 24n8 Apr 27 '22 at 18:57
  • @TonyDelroy you mean `foo* p = new foo{}` right? – 24n8 Apr 27 '22 at 19:02
  • 1
    @24n8: `class`es and `struct`s only vary in their defaults of private vs public inheritance and members, and there are few cases where you have to use one or other keyword (e.g. `template ` is ok but `template ` is not) or be consistent with your earlier declarations. I've listed all the differences in an old SO answer somewhere. Anyway, `foo* p = new foo{}` is both what I meant and what I wrote (later in the sentence it's contrasted with `foo* p = new foo;` which is the non-zero-initialisation version). Cheers – Tony Delroy Apr 27 '22 at 20:05

1 Answers1

1

cppreference.com:

For two parameters k1 and k2 that are equal, std::hash()(k1) == std::hash()(k2). For two different parameters k1 and k2 that are not equal, the probability that std::hash()(k1) == std::hash()(k2) should be very small, approaching 1.0/std::numeric_limitsstd::size_t::max().

std::unordered_map will use hash on its keys. Objects with the same HASH go into the same buckets. However, if two things have the same VALUE, then the second pair is ignored. If key1 and key4 are identical, then key4 and its value will be discarded.

What you want to use is std::unordered_multimap (see here): cppreference.com:

Unordered multimap is an unordered associative container that supports equivalent keys (an unordered_multimap may contain multiple copies of each key value) (emphasis added).

EDIT: As the others are saying, padding might be messing with your hash result. An alternative is to individually hash each member of the struct and then combine the hashes.

Captain Hatteras
  • 490
  • 3
  • 11