Efficiently get hash for a struct smaller than size_t

Question

I have a struct which consists of five std::uint8_t. My software doesn't support 32 bit builds, only 64 bit builds. I want to use my struct as a key in an unordered map. Can I just add three extra bytes to the struct to fill full 64 bit and cast the struct to size_t in order to get a hash safely? Like this:

struct MyStruct
{
    std::uint8_t v1 = 0;
    std::uint8_t v2 = 0;
    std::uint8_t v3 = 0;
    std::uint8_t v4 = 0;
    std::uint8_t v5 = 0;
    std::uint8_t pack1 = 255;
    std::uint8_t pack2 = 255;
    std::uint8_t pack3 = 255;
};

namespace std
{
template <>
struct hash<MyStruct>
{
    size_t operator()(const MyStruct &s) const
    {
        static_assert(sizeof(size_t) == 8);
        static_assert(sizeof(MyStruct) == 8);
        const size_t *memptr = reinterpret_cast<const size_t*>(&s);
        return *memptr;
    }
};
}

You have five bytes – how did you handle them on 32 bit? You actually had the same problem there, too, didn't you? — Aconcagua, May 24 '23 at 08:06
I'm pretty sure this `reinterpret_cast` is not legal for hurting strict aliasing rules, by the way... — Aconcagua, May 24 '23 at 08:08
Unrelated: Consider using [`std::bitcast`](https://en.cppreference.com/w/cpp/numeric/bit_cast), if available, since this also ensures the sizes of the source and target types are the same. If it isn't, you should definetly add a `static_assert` to check for equal sizes; otherwise you may at some point use a compiler with non-64 bit `size_t` without thinking about this specific part of the code, possibly making the hash algorithm much worse depending on the value distribution without you realizing it... — fabian, May 24 '23 at 08:10
There might be also problems with alignment in your code. As an alternative to C++20 `bitcast`, you can use `memcpy`. — Daniel Langr, May 24 '23 at 08:13
Just do some relative prime arithmetic on the fields most likely to change. E.g. `s.v1*17 + s.v2*13 ...`. Just find a combination that is likely to result in a unique hash. If you think that is not efficient enough do some profiling and MEASURE. In the end the "uniqueness" of your hash is more important then it being the fastest to calculate (a hash collision is way more costly). — Pepijn Kramer, May 24 '23 at 09:02
@PepijnKramer I would be much more hesitant to tell people to cook up their own hashes. In fact, your proposed hash is quite likely to fail catastrophically on the collision front. — Passer By, May 24 '23 at 10:05
@PasserBy Yes that's why I mention the fact you need to find a function that is likely to generate a unique hash and that requires knowledge of the input. With perfect knowledge it should be possible to have a go at finding a perfect hashfunction. — Pepijn Kramer, May 24 '23 at 10:24
@PepijnKramer As long as the structure has less bits than the hash value, direct collisions are not an issue. — nielsen, May 24 '23 at 10:29
@nielsen yes you are completely right. Assuming you have a hashtable of big enough size (e.g. with enough buckets to account for all the bits) — Pepijn Kramer, May 24 '23 at 10:38
@PepjinKramer That's a big problem. Not everyone is a hashing expert but doing it wrong is catastrophic. — Passer By, May 24 '23 at 10:43
@PepijnKramer Yes, hence the wording "direct collisions". There is definitely a potential risk of buckets being unbalanced. Whether or not it may be an issue in the case of the OP we cannot say. Deciding on a good hash function requires (or at least has a much better chance of being successful with) knowledge of the specific data to be hashed. — nielsen, May 24 '23 at 11:27

nielsen · Answer 1 · 2023-05-24T08:39:34.087

1

The proposed solution is prone to undefined behavior at least for the reason that memptr may not fulfil the alignment requirement of a size_t.

A better alternative is to use memcpy:

    #include<algorithm>  // std::min

    size_t operator()(const MyStruct &s) const
    {
        size_t result = 0;
        memcpy(&result, &s, std::min(sizeof(s), sizeof(size_t));
        return result;
    }

This should work regardless of size differences between the struct and size_t, but of course will only distinguish two structs on the first sizeof(size_t) bytes.

It should not be a problem with your struct, but generally, you have to be aware that a struct may contain padding bytes with uncontrolled values that may mess up the hash result.

edited May 24 '23 at 08:39

answered May 24 '23 at 08:32

nielsen

5,641
10
27

But isn't padding included in sizeof? So if sizeof(MyStruct) == sizeof(size_t) and I check this with the static_assert above, I don't see how it can fail (at least with the compilers my continuous integration platform checks). – fschmitt May 24 '23 at 09:36
@fschmitt Yes, with the check of the struct size you make sure that there is no padding and as said it is not likely to be an issue with your struct. The remark was meant for the more general case of interpreting the byte representation of an arbitrary struct. – nielsen May 24 '23 at 10:26
Is there a guarantee that the padding bits are always set consistently? Otherwise, if your structure has padding, this relies on the padding bits to be consistent across the entire program. – Dave S May 24 '23 at 10:27
@DaveS No there isn't. – Passer By May 24 '23 at 11:15
@DaveS That is the potential issue I wanted to flag with this remark. To my understanding, you can only rely on the [padding bytes to be zero](https://stackoverflow.com/questions/37642026/does-c-initialize-structure-padding-to-zero) if the struct was originally initialized. – nielsen May 24 '23 at 11:17

score 1 · Answer 2 · answered May 24 '23 at 10:54

You need the type as a key to standard unorderd containers, and it is smaller in size than std::size_t. Therefore the bit pattern can be used as a perfect hash function. You do not need strange techniques for it:

std::size_t hash_res = 0;
for(auto byte:
         std::to_array<std::uint8_t>
         ({s.v1, s.v2, s.v3, s.v4, s.v5}))
    (hash_res<<=8)+=byte;

If the compiler can't optimize it, the more complicated version would be:

auto const hash_res=
     std::bit_cast<std::size_t>
     (std::array<std::uint8_t, sizeof(std::size_t)>
     {s.v1, s.v2, s.v3, s.v4, s.v5});

To wrap things up:

return std::hash<std::size_t>{}(hash_res);

This will modify the result in case std::hash on built-in integrals is anything other than identity. If you prefer a perfect hash, you can skip rehashing as integer to avoid probable truncation that leads to collision probability.

Exactly. If your potential key size is 64 bits and your actual key is only 40 bits, then there's no need for hashing at all. Just use the actual key. — Jim Mischel, May 25 '23 at 13:52
@JimMischel that's what OP snippet seems to be trying to do. I am putting less effort to achieve the same with more readabily. — Red.Wave, May 25 '23 at 15:55

Tony Delroy · Accepted Answer · 2023-05-30T07:26:25.983

Can I just add three extra bytes to the struct to fill full 64 bit and cast the struct to size_t in order to get a hash safely?

No - as others have mentioned, you'll have undefined behaviour due to both MyStruct potentially having different alignment than size_t, and due to aliasing (you can only safely reinterpret_cast the size_t through char*, unsigned char* or std::byte*). As of C++20, std::bitcast is the recommended way to do this: std::bitcast<size_t>(some_MyStruct_object).

While the above's already been said by Red.Wave and nielsen, Red.Wave mentioned:

This will modify the result in case std::hash on built-in integrals is anything other than identity.

In practice, std::hash<size_t> - to be best of my knowledge - is an identity hash function in clang, GCC, and MSVC++. Certainly in current and all vaguely recent versions of clang and GCC (I've just rechecked on godbolt). Thankfully they use prime numbers for bucket count, so it doesn't matter. But MSVC++ has historically (and I imagine still, but godbolt won't execute code under MSVC++) used powers-of-two for bucket count, so it does matter.

On MSVC++ and any other implementation with power-of-two bucket count, the simple bitcast approach will create terrible hash table collisions. When the hash function returns a number it is folded into the bucket_count by masking with the number of buckets - 1, which effectively only uses however many of the least-significant bits are necessary to identify a bucket (for 64 buckets -> 6 bits, for 128 buckets -> 7 bits etc.).

To try to make this clearer, say your MyStruct object has values {ab, cd, ef, gh, ij, pad1, pad2, pad3} - where the two-letter combinations represent 2-digit hex value representations of your uint8_ts, and your hash table bucket_count is currently 256. You hash your object and end up with - it your system is little endian - FFFF'FFij'ghef'cdab. Then you mask out the low order 8 bits to get a 0..255 bucket index. Only that byte - ab - from your MyStruct object will affect the bucket you hash/mask to. If your data was {1, 2, 3, 4, 5}, {1, 202, 18, 48, 2}, {1, 7, 27, 87, 85}, {1, 48, 26, 58, 16} -> all those entries would collide at bucket 1. Your hash table then performs like a linked list. If - with your endianness - padding bytes are moved into less signficant bit positions in the size_t, they won't contribute in the slightest to randomise/dispersing your bucket usage.

While it's reasonable to first generate a size_t value from MyStruct with a bitcast, you may want to then perform some actual, meaningful hashing on it. As mentioned, you typically can't simply invoke std::hash<size_t>() on it, as that's often an identity hash. So, find an SO question or reference with a decent hash for size_t, or use something like the the Intel CRC instruction _mm_crc32_u64.

(Because these things are tricky and implementation choices sometimes surprising, when you have reason to care about performance, it's generally a good idea to measure collision chain lengths with your data and hash function, to ensure you don't have unexpected collision rates.)

Thanks for this very informative response which made things a lot clearer for me. — fschmitt, May 30 '23 at 07:04

Efficiently get hash for a struct smaller than size_t

3 Answers3