5

For code such as this:

#include <stdint.h>

char* ptrAdd(char* ptr, uint32_t x)
{
    return ptr + (uint32_t)__builtin_ctz(x);
}

GCC generates a sign-extension: (godbolt link)

xor eax, eax
rep bsf eax, esi
cdqe ; sign-extend eax into rax
add rax, rdi
ret

This is, of course, completely redundant - this is blatantly sign-extending an unsigned integer. Can I convince GCC not to do this?

The problem exists since GCC 4.9.0, but before that it used to be an explicit zero-extension which is also redundant.

Michael Petch
  • 46,082
  • 8
  • 107
  • 198
harold
  • 61,398
  • 6
  • 86
  • 164
  • You could *unsafely* use `asm("" :"=r"(ctz64) :"0"(ctz));` to tell the compiler that a 64-bit output is in the same register as the 32-bit input, with no instructions (https://godbolt.org/g/Gafud2). But that's pretty horrible, and doesn't protect you from high garbage in the input 32-bit value if some optimization resulted in not just using the tzcnt result register. Also, I'm not sure what `bsf` does with `x=0`; it may not even zero-extend EAX into RAX. (I'm seeing sign extension even with `-march=haswell` where tzcnt is used explicitly, though, which always writes EAX). – Peter Cordes Feb 06 '18 at 03:45
  • gcc has a few of these missed-optimization bugs, where it doesn't know as much as it should about the output value. This one is super weird; I think internally `bsf` / `tzcnt` must be defined as returning an `int` or `long`, but IDK why casting to `uint32_t` first isn't fixing it. – Peter Cordes Feb 06 '18 at 03:49
  • I wonder if `__builtin_assume()` helps. – fuz Feb 06 '18 at 17:06
  • 1
    @PeterCordes - this looks like an actual bad code bug in the optimizer. As I mention in my answer, on underlying issue seems to be that undefined behavior of `bsf/bsr` makes what you'd normally assume about the builting return value (i.e., that it is in the range 0-31) not necessarily be true, so the compiler has to be conservative in terms of extension, but the generated code here seems _wrong_ - it could result in subtracting a value from `ptr`, ignoring the cast to unsigned (unless I'm mixing up my C conversion rules here). Probably because these intrinsics are special cased? – BeeOnRope Feb 06 '18 at 18:15
  • @BeeOnRope: Well it's not technically a bug, because [`__builtin_ctz`](https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html): *If x is 0, the result is undefined*. This isn't the `ffs` library function or anything; the builtin appears to be defined to compile efficiently with only bsf, not tzcnt. (But unfortunately these days doesn't.) – Peter Cordes Feb 06 '18 at 18:19
  • @PeterCordes - it's a bug in the sense that per the linked bugzilla the intent of the gcc devs is that "undefined result" still means "a result that acts as an `int`" not "undefined behavior in the C standard sense which means that anything can happen". They give a specific example where, for example, masking the returned value with `& 0xFF` should result in the expected behavior. That doesn't actually work though (masks are ignored, so maybe they changed the strategy at some point). That masking example doesn't actually work though (the mask is ignored)! – BeeOnRope Feb 06 '18 at 18:22
  • @BeeOnRope: Oh you're right, it says *result*, not *behaviour*, so it's not supposed to be the old here-be-dragons C UB. – Peter Cordes Feb 06 '18 at 18:40
  • FWIW the behaviour of `bsf` with a zero-input is documented by AMD, and it works the same on every Intel processor so far, they just don't admit it in their manual. – harold Feb 06 '18 at 20:51
  • @harold - yup, I don't think it will ever change. Note that in the bug I linked the gcc devs were originally reluctant to treat the output of `bsf` as anything other than totally unknown, but recent versions definitely don't take that conservative approach. In particular, gcc always seems to assume the value is in the range 0-31 (or 0-63) and so removes the the `&` from `__builtin_ctzl(x) & 0xFF` which was the litmus test described in that bug. Even though gcc knows that, it still does `cdqe`! – BeeOnRope Feb 06 '18 at 21:17
  • My conclusion is that `cdqe` (it's a `movsx` at -O0) seems to be tightly bound to the way the intrinsic is generated during some particular compile phase and casts and other tricks don't seem to work since they are either optimized away earlier, or don't interact with the intrinsic expansion which may happen "late". In some cases the `cdqe` can be made to disapear, e.g., see the `ffs` example [here](https://godbolt.org/g/TgwCr7) where adding it first to a 32-bit value results in the "expected" code without the `cdqe` but this is perhaps a peephole or other late optimization. – BeeOnRope Feb 06 '18 at 21:22

1 Answers1

4

A partial solution is to use the 64-bit version of ctz, along with a -march argument so that tzcnt is used instead of bsf, like so:

char* ptrAdd(char* ptr, uint32_t x)
{
    return ptr + __builtin_ctzl(x);
}

This results in no sign extension:

ptrAdd(char*, unsigned int):
  mov eax, esi
  tzcnt rax, rax
  add rax, rdi
  ret

It has a mov (to do the 32 to 64-bit zero extension) which replaced a zeroing xor in the 32-bit version (which was there to work around the tzcnt false-dependency-on-destination issue). Those are about the same cost, but the mov is more likely to disappear after inlining. The result of a 64-bit tzcnt is the same as a 32-bit one, except for the case of zero input which is undefined (as far as the gcc intrinsic goes, not tzcnt).

Unfortunately, without a -march argument that lets the compiler use tzcnt it will use bsf and in that case still does the sign extension.

It seems that the origin of the differing behavior between bsf and tzcnt is that in the case that bsf version is used, the instruction behavior is undefined at zero. So in principle, the instruction could return anything, even values outside the range 0 to 63 that we would normally expect. Combined with the fact that the return value is declared as int, simply omitting the sign extension could lead to "impossible" situations like (__builtin_clzl (x) & 0xff) == 0xdeadbeef.

Now per the gcc docs, zero input to __builtin_ctzl has an "undefined result" - but it isn't clear if this is the same as C/C++ "undefined behavior" where anything can happen (which would allow impossible things), or just means "some unspecified value".

You can read about this on the gcc bugzilla, where an issue has been open for about 7 years.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
  • 1
    It seems `gcc7.3 -march=skylake` doesn't know that Skylake fixed the `tzcnt` / `lzcnt` (but not `popcnt`) false dependency on the destination. – Peter Cordes Feb 06 '18 at 17:57
  • 2
    See also this [related question](https://stackoverflow.com/q/40456481/149138) where I was asking something slightly different, but the answer is actually the answer for this question (links to the same bug entry). – BeeOnRope Feb 06 '18 at 18:28