C++ switch statement with case values of reinterpret_cast(string)

Question

I receive a string at run-time and I want to convert it to an integer. Rather than just reinterpret-casting the string and using this integer (the integers will be too sparse) I am trying to use a switch statement (see below). However, unfortunately it doesn't seem possible as I'm getting compiler errors about unexpected syntax.

Is it possible to achieve this? The reinterpret_casted string is known at compile time?

// I receive a string at run-time and convert to integer
const uint64_t val = *reinterpet_cast<uint64_t*>(str.c_str());

switch(val)
{
    case (*reinterpret_cast<uint64_t*>("16 byte string1")):   // This not the string, just an example
        return 1;
    case (*reinterpret_cast<uint64_t*>("16 byte string2")):
        return 2;
    // Another 100+ similar cases
}

The syntax problem is fixable, but you know that you can only put 8 characters in an `uint64_t` right? For the strings you showed, that's already a problem, the first 8 characters are identical — harold, Sep 17 '22 at 17:16
are you sure that `reinterpet_cast` can convert 16 byte string into 8 byte integer? — Iłya Bursov, Sep 17 '22 at 17:17
maybe you need https://stackoverflow.com/questions/42356939/c-convert-string-to-uint64-t ? — Iłya Bursov, Sep 17 '22 at 17:18
@harold good point. Looks like I need to do a direct string comparison, maybe using SIMD though — intrigued_66, Sep 17 '22 at 17:21
@IłyaBursov My string is not a number. It's an non-integer ID that I want to convert to an integer. — intrigued_66, Sep 17 '22 at 17:23
SIMD comparison would do it, but then you'd have about a hundred of them, in the worst case executing all of them.. perhaps you can design a good hash function, if you know the IDs? — harold, Sep 17 '22 at 17:26
@harold Originally I had 3 IDs and did it manually. Now i've got about 200 and didn't even attempt it, but perhaps I will have to. — intrigued_66, Sep 17 '22 at 17:28
You could also use some nested switches, that each take only a couple of characters (like a trie, but in code, without a data structure) — harold, Sep 17 '22 at 17:31
have you heard of [`gperf`](https://www.gnu.org/software/gperf/)? — Alex Reinking, Sep 17 '22 at 17:32
I think `std::map` or `std::unordered_map` would be much more suited to this task than a `switch` statement. — Mark Ransom, Sep 17 '22 at 17:40
Are you compiling with `gcc -fno-strict-aliasing`, or with MSVC or something? Otherwise that deref of a `uint64_t*` is strict-aliasing UB. (Use `memcpy` or `std::bit_cast(struct_of_8_bytes)` or something.) — Peter Cordes, Sep 17 '22 at 20:56
The problem is that, a reinterpret_cast expression is not a core constant expression, and therefore cannot be used where a constant expression is required. — mada, Sep 18 '22 at 06:36

user17732522 · Answer 1 · 2022-09-17T22:00:04.603

3

The case operand must be a constant expression. reinterpret_cast can never be evaluated in a constant expression. Therefore this is not going to work and you will have to fall back to an if/if else/else chain or implement something more clever (e.g. a jump table) yourself. switch is not going to be usable for this.

That is of course as long as you intent to use reinterpret_cast. If you e.g. write a constexpr function that converts the first few characters to an uint64_t via arithmetic and bitwise operations instead of a reinterpret_cast, then you can use that function in a case operand.

Using this instead of reinterpret_cast would also get rid of *reinterpret_cast<uint64_t*>(str.c_str()) which by itself is an aliasing violation and therefore has undefined behavior per standard. Some compilers, either straight-up or when given specific compilation flags, will however define the behavior of such aliasing violations to what you intent it to be. (However, core language undefined behavior per standard, even if defined by the specific compiler, is not allowed in a constant expression. Also, even if aliasing is not an issue, alignment may still be.)

edited Sep 17 '22 at 22:00

answered Sep 17 '22 at 17:23

user17732522

53,019
2
56
105

Thank you for your answer. Could you elaborate on the aliasing violation part? – intrigued_66 Sep 17 '22 at 17:24
@mezamorphic You are trying to access a `char` object (which is part of the string) through a pointer to `uint64_t`. Except in a few specific exceptions, it is never allowed to access an object through a pointer to a different type, which is why `reinterpret_cast` to any pointer type other than `char*`/`unsigned char*`/`std::byte*` is more likely than not going to be UB. – user17732522 Sep 17 '22 at 17:25
If an implementation defines the behaviour, it's no longer undefined, and thus *would* be allowed in a constant expression. The ISO standard doesn't forbid implementations from accepting `INT_MAX + 1` as a constant, e.g. when compiled with `gcc -fwrapv`. The ISO standard does forbid `reinterpret_cast` in constant expressions and `gcc -fno-strict-aliasing` doesn't override that, but that's separate from the dereference being well-defined (assuming that the string literal happens to have `alignof(uint64_t)`!) – Peter Cordes Sep 17 '22 at 21:01
@PeterCordes I do not think that this is correct. The standard specifies that an expression which on evaluation would have undefined behavior as specified in the core language part of the standard is not a (core) constant expression (https://eel.is/c++draft/expr.const#5.8). And in contexts in which constant expressions are required an expression which is not a constant expression makes the program ill-formed with a diagnostic required. Of course (as always) that doesn't mean that the program must be rejected though. – user17732522 Sep 17 '22 at 21:47
@user17732522: Interesting, seems you're right for GCC and clang `-fwrapv`. https://godbolt.org/z/9v8s7ah8K - Signed integer overflow prevents `(INT_MAX+INT_MAX) & 0xff` from being a constant expression (0xfe), even though `-fwrapv` defines the behaviour as wrapping. (So `-fsanitize=undefined` doesn't flag it, but it does without -fwrapv). I'd have expected that defining the behaviour would make it no longer count as "UB as specified in [intro] through [cpp]", as in `-fwrapv` would compile a dialect defined by a modification of the standard. (Or something equivalent to that.) – Peter Cordes Sep 17 '22 at 22:19
@PeterCordes https://eel.is/c++draft/expr.const#5.31 makes it unspecified whether an expression that would have undefined behavior according to the library clauses is a constant expression or not. If the intent had been to allow compilers to extent the set of constant expression like that, then this distinction between core language and library wouldn't have been made. Unfortunately I don't think there is any rationale from the committee available for this. [DR 1313](https://wg21.link/cwg1313) and [DR 695](https://wg21.link/cwg695) which introduced this just state that it "should" be so. – user17732522 Sep 17 '22 at 22:56
However whether an expression is a constant expression or not affects overload resolution, so I suppose part of the intention is to have some consistency on that between compilers. – user17732522 Sep 17 '22 at 23:00
Does ISO C++ ever say anything about the possibility of implementations defining the behaviour of things it explicitly says are undefined? In terms of real-world usability, I don't think it would violate the spirit of the standard if GCC and clang `-fwrapv` had decided that defining the behaviour of signed-integer wrapping was a change to the core language rules. It's clearly not a library feature. Oh, but good point about overload resolution; that's a good reason for compilers to choose not to do that, especially for an *optional* feature that's not on by default. – Peter Cordes Sep 17 '22 at 23:02
1

@PeterCordes I guess https://eel.is/c++draft/intro#compliance.general-8? But I don't think that actually says anything of relevance. It still requires the same diagnostics of ill-formed programs and same behavior of well-formed programs in the presence of language extensions. – user17732522 Sep 17 '22 at 23:22
Right, yeah, UB ruling out something being a constant expression is a weird requirement. If there was UB during compile-time eval, absolutely anything is allowed to happen, including picking the "wrong" overload. Or in GCC's case, complaining that the array size "exceeds maximum object size '9223372036854775807'" (with or without `-fwrapv`; in the with case, that might be a GCC bug at least in terms of a wrong error message since the same expression is a well-defined 254 when a constexpr isn't required.) Well/ill-formed is different from UB-free or not. – Peter Cordes Sep 17 '22 at 23:30
@PeterCordes _Ill-formed_ requires a diagnostic to be issued, UB does not. Then there is also _ill-formed, no diagnostic required_ which basically has the same effect as UB. – user17732522 Sep 17 '22 at 23:41
I think the idea here is more to have the compile-time language be as consistent as possible between compilers and have the language be more portable. Under this viewpoint there is no negative to diagnosing UB in constant expressions. Because the set of constant expressions has been increased so far, this can now actually be used in interesting ways, e.g. to have unit tests run as part of the compilation process with in-build UB and leak detection (and a wider coverage than UBsan would have). – user17732522 Sep 17 '22 at 23:41
@PeterCordes Also, I am not sure whether that came across correctly, but what I meant is that the compile-time constness of an expression can simply be directly used to execute different logic in multiple ways. For example `if consteval` can be used to detect and branch on whether the current function is executed as part of a constant expression. `requires { typename dummy<((void)(E),0)>; }` with `template struct dummy {};` is an expression that evaluates to true or false depending on whether arbitrary expressions `E` are constant expressions and can be used to disable overloads, etc. – user17732522 Sep 17 '22 at 23:49

score 1 · Answer 2 · answered Sep 17 '22 at 21:36

First of all, sizeof(uint64_t) is 8 on machines with 8-bit char. You're only going to get the first 8 bytes of strings, not 16. IDK if you're mixing this up with an 8-byte integer printing as 16 hex digits, where each 4 bits of a number map to an 8-bit ASCII character, or some other mistake, but a 16-character string is 128 bits on a normal system, so you'd need x86 __m128i (SSE2 integer vector of 16 bytes) or GCC unsigned __int128.

If your string values are sparse in their first 8 bytes, a compiler will probably just make a chain of conditional branches, not a hash table of jumps. (Or a hash table of data. Transforming control-flow to data-lookup is something compilers can sometimes do for switch, but AFAIK current compilers like GCC and clang only use plain arrays when the switch cases are mostly contiguous and in a small range. So you'd still have a chain of branches because of your sparse 64-bit integers.)

Anyway, the optimal implementation in asm is probably a hash table, so you should just write your source code to do that instead of a switch, as Mark Ransom commented. Use std::unordered_map.

Another comment also suggested gperf, the GNU Perfect Hash function generator. Given a set of strings (like the keywords of a language), it generates code that detects them and rejects other tokens, with no false positives or negatives. That might be worth considering.

std::unordered_map can use string keys. The constants to match against would have to get hashed at compile time, or once at run-time during construction, and the incoming string would have to get hashed.

If the strings are unique in their first 8 bytes (not 16), you might use std::unordered_map<uint64_t, int> if that makes key faster to hash than arbitrary-length strings. And it means the full strings wouldn't have to get stored anywhere, just the prefixes. Run-time init of a std::unordered_map doesn't require compile-time constants, so you don't need a constexpr-compatible way to take the first 8 bytes of strings.

But *reinterpret_cast<uint64_t*>("literal") is strict-aliasing undefined behaviour unless you're compiling with clang/gcc -fno-strict-aliasing, or with MSVC. Even if the aliasing behaviour is defined, it's still not usable in a constexpr.

C++20 std::bit_cast is the go-to for type-punning data (and is constexpr), but you might need your data in a struct of 8 bytes, or an 8-byte array, because it checks that the two types have the same size. So you might need to do struct eightbyte {char str[8];} and manually truncate your strings for initializing it if you wanted to do something that was fully constexpr.

memcpy works well and gets fully optimized; not in a constexpr compatible way, but with optimization enabled does in fact become a constant. Not one you could use in a switch or as the size of an array (because it's not constexpr / consteval), but fine for optimization purposes.

#include <string.h>
#include <stdint.h>

// not usable in a  switch / case because this isn't constexpr compatible, 
// but usable for run-time init of std::unordered_map<uint64_t, int>
uint64_t str_as_u64(const char *p)
{
   uint64_t tmp;
   memcpy(&tmp, p, sizeof(tmp));
   return tmp;
}

uint64_t test(){
    return str_as_u64("hello world");
}

GCC (Godbolt) optimized the test function to returning an immediate constant, no actual loading / storing and no string data hanging around in memory anywhere:

# GCC12.2 -O3 for x86-64
test():
        movabsq $8031924123371070824, %rax   # 0x6F77206F6C6C6568
        ret

Further constant-propagation through hash functions might also happen; I didn't look at constructing a std::unordered_map.

C++ switch statement with case values of reinterpret_cast(string)

2 Answers2