First of all, sizeof(uint64_t)
is 8
on machines with 8-bit char
. You're only going to get the first 8 bytes of strings, not 16. IDK if you're mixing this up with an 8-byte integer printing as 16 hex digits, where each 4 bits of a number map to an 8-bit ASCII character, or some other mistake, but a 16-character string is 128 bits on a normal system, so you'd need x86 __m128i
(SSE2 integer vector of 16 bytes) or GCC unsigned __int128
.
If your string values are sparse in their first 8 bytes, a compiler will probably just make a chain of conditional branches, not a hash table of jumps.
(Or a hash table of data. Transforming control-flow to data-lookup is something compilers can sometimes do for switch
, but AFAIK current compilers like GCC and clang only use plain arrays when the switch cases are mostly contiguous and in a small range. So you'd still have a chain of branches because of your sparse 64-bit integers.)
Anyway, the optimal implementation in asm is probably a hash table, so you should just write your source code to do that instead of a switch
, as Mark Ransom commented. Use std::unordered_map
.
Another comment also suggested gperf
, the GNU Perfect Hash function generator. Given a set of strings (like the keywords of a language), it generates code that detects them and rejects other tokens, with no false positives or negatives. That might be worth considering.
std::unordered_map
can use string keys. The constants to match against would have to get hashed at compile time, or once at run-time during construction, and the incoming string would have to get hashed.
If the strings are unique in their first 8 bytes (not 16), you might use std::unordered_map<uint64_t, int>
if that makes key faster to hash than arbitrary-length strings. And it means the full strings wouldn't have to get stored anywhere, just the prefixes. Run-time init of a std::unordered_map
doesn't require compile-time constants, so you don't need a constexpr
-compatible way to take the first 8 bytes of strings.
But *reinterpret_cast<uint64_t*>("literal")
is strict-aliasing undefined behaviour unless you're compiling with clang/gcc -fno-strict-aliasing
, or with MSVC. Even if the aliasing behaviour is defined, it's still not usable in a constexpr
.
C++20 std::bit_cast
is the go-to for type-punning data (and is constexpr
), but you might need your data in a struct of 8 bytes, or an 8-byte array, because it checks that the two types have the same size. So you might need to do struct eightbyte {char str[8];}
and manually truncate your strings for initializing it if you wanted to do something that was fully constexpr.
memcpy
works well and gets fully optimized; not in a constexpr
compatible way, but with optimization enabled does in fact become a constant. Not one you could use in a switch
or as the size of an array (because it's not constexpr
/ consteval
), but fine for optimization purposes.
#include <string.h>
#include <stdint.h>
// not usable in a switch / case because this isn't constexpr compatible,
// but usable for run-time init of std::unordered_map<uint64_t, int>
uint64_t str_as_u64(const char *p)
{
uint64_t tmp;
memcpy(&tmp, p, sizeof(tmp));
return tmp;
}
uint64_t test(){
return str_as_u64("hello world");
}
GCC (Godbolt) optimized the test
function to returning an immediate constant, no actual loading / storing and no string data hanging around in memory anywhere:
# GCC12.2 -O3 for x86-64
test():
movabsq $8031924123371070824, %rax # 0x6F77206F6C6C6568
ret
Further constant-propagation through hash functions might also happen; I didn't look at constructing a std::unordered_map
.