Can I just add three extra bytes to the struct to fill full 64 bit and cast the struct to size_t in order to get a hash safely?
No - as others have mentioned, you'll have undefined behaviour due to both MyStruct potentially having different alignment than size_t
, and due to aliasing (you can only safely reinterpret_cast
the size_t
through char*
, unsigned char*
or std::byte*
). As of C++20, std::bitcast
is the recommended way to do this: std::bitcast<size_t>(some_MyStruct_object)
.
While the above's already been said by Red.Wave and nielsen, Red.Wave mentioned:
This will modify the result in case std::hash on built-in integrals is anything other than identity.
In practice, std::hash<size_t>
- to be best of my knowledge - is an identity hash function in clang, GCC, and MSVC++. Certainly in current and all vaguely recent versions of clang and GCC (I've just rechecked on godbolt). Thankfully they use prime numbers for bucket count, so it doesn't matter. But MSVC++ has historically (and I imagine still, but godbolt won't execute code under MSVC++) used powers-of-two for bucket count, so it does matter.
On MSVC++ and any other implementation with power-of-two bucket count, the simple bitcast approach will create terrible hash table collisions. When the hash function returns a number it is folded into the bucket_count by masking with the number of buckets - 1, which effectively only uses however many of the least-significant bits are necessary to identify a bucket (for 64 buckets -> 6 bits, for 128 buckets -> 7 bits etc.).
To try to make this clearer, say your MyStruct
object has values {ab, cd, ef, gh, ij, pad1, pad2, pad3} - where the two-letter combinations represent 2-digit hex value representations of your uint8_t
s, and your hash table bucket_count is currently 256. You hash your object and end up with - it your system is little endian - FFFF'FFij'ghef'cdab. Then you mask out the low order 8 bits to get a 0..255 bucket index. Only that byte - ab - from your MyStruct
object will affect the bucket you hash/mask to. If your data was {1, 2, 3, 4, 5}, {1, 202, 18, 48, 2}, {1, 7, 27, 87, 85}, {1, 48, 26, 58, 16} -> all those entries would collide at bucket 1. Your hash table then performs like a linked list. If - with your endianness - padding bytes are moved into less signficant bit positions in the size_t
, they won't contribute in the slightest to randomise/dispersing your bucket usage.
While it's reasonable to first generate a size_t
value from MyStruct
with a bitcast
, you may want to then perform some actual, meaningful hashing on it. As mentioned, you typically can't simply invoke std::hash<size_t>()
on it, as that's often an identity hash. So, find an SO question or reference with a decent hash for size_t
, or use something like the the Intel CRC instruction _mm_crc32_u64
.
(Because these things are tricky and implementation choices sometimes surprising, when you have reason to care about performance, it's generally a good idea to measure collision chain lengths with your data and hash function, to ensure you don't have unexpected collision rates.)