4

I have this code and I'm wondering why the compiler doesn't join the two character compares to a 16-bit compare.

inline bool beginswith (char c0, char c1, const char* s) {
    return (s[0] == c0) & (s[1] == c1);
}

bool test(const char* s) {
    return beginswith('0', 'x', s);
}

I would expect the two versions in my Godbolt link (test and test2) to compile to equivalent instructions. Is there something I'm missing because this seems like a pretty trivial optimization.

Edit: The reason I would like the compiler to do this optimization is so I can write portable code with no undefined behavior and still get full performance.

Lundin
  • 195,001
  • 40
  • 254
  • 396
christianO
  • 167
  • 11
  • I can't think of any technical reason. Probably this is just such a rare case that no compiler vendor has bothered to create optimizations for it. – Thomas Feb 25 '20 at 10:04
  • 3
    I've managed to make GCC output identical assembly with `char const b[]{c0, c1}; return !std::memcmp(b, s, sizeof b);`. MSVC is unphased. – Quentin Feb 25 '20 at 10:04
  • 7
    My guess is, that the compiler does not do this optimization because here `*(const short*)s == (c0 | (c1 << 8));` an unaligned access might happen, which can have a huge performance penalty compared to the byte access version – Ctx Feb 25 '20 at 10:05
  • 1
    @Ctx isn't `*(const short*)s == ...` UB? You're accessing the object as if it were some other type. – Aykhan Hagverdili Feb 25 '20 at 10:07
  • 6
    @Ayxan: Doesn't matter. Compilers are allowed to define and therefore exploit UB; it's just the programmers who are not allowed to. – Bathsheba Feb 25 '20 at 10:08
  • 2
    `beginswith2` is undefined behavior. Depending on the endianess of your machine, you either compare `c0`/`c1` with the msb/lsb of s... – user1810087 Feb 25 '20 at 10:09
  • 2
    `const char* s = "_x"; beginswith('x', 'X', s+1);` <- `char` access to `s+1` si faster (at the hardware level, on some archs) than `short` access. The difference in output assembly might reflect that difference. – YSC Feb 25 '20 at 10:09
  • 1
    @YSC Unlikely, given that GCC and clang transform a fixed-sized `memcmp` to the same output as the `short` aliasing compare, rather than performing indexed access. – Konrad Rudolph Feb 25 '20 at 10:10
  • Dereferencing a pointer with a type other than what it actually points to is undefined behaviour. You can't make any assumptions as to why is producing code in one way or another. The resulting code can perfectly crash your program or _make demons fly out of your nose_. – Jorge Bellon Feb 25 '20 at 10:16
  • 2
    @SanderDeDycker Using `&&` introduces a branch so this might be an optimization. See the OP's godbolt, swapping to `&&` gives far worse code, simply because of the required "short circuit evaluation". – Lundin Feb 25 '20 at 10:16
  • @Lundin : you are of course absolutely right. My gut reaction that the compiler would find it harder to optimize with the `&` was clearly wrong - time to re-calibrate my gut :) – Sander De Dycker Feb 25 '20 at 10:19
  • @Bathsheba - programmers are allowed to write code with undefined behaviour, or that exploits undefined behaviour. The catches include the standard not providing them any guarantees concerning what happens (by definition) and there is the possibility of their code working differently (or not at all) with different compilers/libraries or versions of compilers/libraries or the phase of the moon. But, if they accept those consequences, there is nothing preventing them doing it. – Peter Feb 25 '20 at 10:21
  • 2
    @KonradRudolph, compilers accessing `short *` use word access because usually almost all pointers to `short` aligned at boundary of 2. Even if it is unaligned it usually programmers problem, not compilers. But when you access `char *` you do not have such guarantee. And in unaligned case it will be indeed slower on some (most modern?) architectures than byte access. – sklott Feb 25 '20 at 10:26
  • 1
    @sklott You misunderstood my comment. The compiler transforms `char c[] = {c0, c1}; return memcpy(c, s, 2) == 0;` into the same code as the `short` aliasing compare. Meaning, the compiler determined that the fastest way of implementing the memcmp is *not* via pointer increment, as hypothesised by YSC, but by performing a word-wise compare. This *might* mean that the compiler can prove that the memory is properly aligned for such access, but whether that’s the case is irrelevant for my comment; all that matters is that the compiler does this. – Konrad Rudolph Feb 25 '20 at 10:29
  • 2
    @Ctx I tried __builtin_assume_aligned in gcc and clang and it doesn't change the generated instructions so it doesn't seem to be related to alignment. – christianO Feb 25 '20 at 10:53
  • 1
    An optimization heuristic to replace multiple comparisons with one `memcmp` doesn't seem to exist. – Maxim Egorushkin Feb 25 '20 at 11:30
  • @christianO Kindly stop adding the C tag. According to our cross-tagging rules for C and C++, which you can find under the respective [tag wiki](https://stackoverflow.com/tags/c%2b%2b/info), a program that is compiled as C++ should be tagged C++ only. In this specific case the `bool` type, the character constant type and the type punning rules are _different_ between C and C++. As are `aligas`, `alignof` and misc other stuff that may or may not be relevant. – Lundin Feb 25 '20 at 15:16
  • 1
    Quick question: are you sure the `short` version actually _is_ faster? The `MOV->XOR` dependency chains should be very parallelizable, so I think it ends up only one instruction worse, and it's anyone's guess how long that really takes without benchmarking. – Useless Feb 25 '20 at 15:49

1 Answers1

1

It appears that the compilers simply aren't smart enough to treat inlining as a special case here. They generate the very same code as they do when the function has external linkage, in which case there is no connection assumed between the parameters c0 and c1.

Inside the function, c0 and c1 reside in local copies according to ABI calling convention, which isn't necessarily placing them next to each other for aligned access - even though that does happen to be the case in your specific x86 example.

Changing the caller code to alignas(int) char str[2] = {'0', 'x'}; return beginswith(str[0],str[1],s); doesn't make a difference, so alignment doesn't seem to be the (sole) reason either.

However, in case you change the function to this:

inline bool beginswith (char c0, char c1, const char* s)
{
    char tmp[2] = {c0, c1};
    return memcmp(tmp, s, 2);
}

Then you get identical machine code in both cases (godbolt].

test(char const*):
  cmp WORD PTR [rdi], 30768
  setne al
  ret

It should be noted that *(const short*)s in test2 is undefined behavior, a strict aliasing violation. The machine code for that line may not stay deterministic when the program starts to scale up or when optimizer settings are changed. This is mostly a quality of implementation thing, where the standard makes no guarantees and test2 might break at any point.

Lundin
  • 195,001
  • 40
  • 254
  • 396
  • Note: in C you might have been able to dodge that undefined behavior with union type punning, but that isn't possible to do in C++. – Lundin Feb 25 '20 at 11:55
  • 2
    The parameter passing is a red herring. You can change the parameters to a `char const (&)[2]` in C++, yet the compiler still doesn’t merge the comparisons. there simply seems to be no rule in the optimiser that would merge such comparisons. – Konrad Rudolph Feb 25 '20 at 12:34
  • In your alignment example the data pointed to by `s` is still unaligned. – Maxim Egorushkin Feb 25 '20 at 13:07
  • 2
    @MaximEgorushkin Not this again... No they _cannot_. You can go from pointer-to-type to pointer-to-char and then de-reference as character type **but not the other way around**. [What is the strict aliasing rule?](https://stackoverflow.com/questions/98650/what-is-the-strict-aliasing-rule). – Lundin Feb 25 '20 at 15:12
  • 2
    Yes, but that's only valid if the original buffer really is an array of short. If you have an array of double, and cast that to `char*`, then accessing it as a `short` elsewhere as an optimization could be problematic. Admittedly, unless it gets inlined in a context where it's also being mutated via `double*`, it might not break anything. – Useless Feb 25 '20 at 15:46
  • @MaximEgorushkin From the language standard, namely C11 6.5/7. "An object shall have its stored value accessed only by an lvalue expression that has one of the following types: ... a character type." This special case exception does _not_ somehow mean that the other way around is suddenly ok too and an object of type char can have its value accessed by any type. – Lundin Feb 26 '20 at 15:51
  • @MaximEgorushkin Ok so care to tell me how C++11 3.10/10 is any different? "If a program attempts to access the stored value of an object through a glvalue of other than one of the following types the behavior is undefined: ... a char or unsigned char type." Because that looks pretty much identical to me. – Lundin Feb 26 '20 at 16:12
  • https://gist.github.com/shafik/848ae25ee209f698763cffee272a58f8 – Maxim Egorushkin Feb 26 '20 at 16:16
  • @MaximEgorushkin All that's saying is that C++ doesn't have the concept of effective type. Which doesn't matter. How exactly does that make accessing a char array through `*(banana_t*)array` well-defined? – Lundin Feb 26 '20 at 16:22
  • @Lundin Okay, I withdraw my argument. – Maxim Egorushkin Feb 26 '20 at 16:31