Inlining assembly in C

Question

I'm writing a chess engine in c, and speed is essential. The chess engine is based on unsigned long long which I will denote as u64 and it relies heavily on a least significant bit scan. Up until now I have been using the gcc function __builtin_ctzll which has done the job just fine. I however generated assembly code for this isolated function with gcc -S -O2. It gave me the following:

xorl     %eax, %eax
rep bsfq %rdi, %rax
cltq
ret

It seems however after some investigation that the code

rep bsfq %rdi, %rax
ret

does exactly the same thing in my chess program. It is however now ~20% slower. It should be faster because it's fewer instructions. The original __builtin_ctzll is however inlined in my c code. Is this the reason my custom assembly code is running slower than the original? Because when I declare the function ctzll I can of course not declare it inline in c unless I have a definition (which is not in assembly code).

Is there another way to optimize assembly instructions or should I try my new assembly code inlined with asm directly in c?

Just because something is fewer instructions doesn't mean it's going to be faster. Also the likely reason you have the `cltq` is because you have a return type wrong. The `xor` is practically free. — Jester, Apr 03 '22 at 13:30
The reason for the `xor` is that `bsf` produces undefined result if the input is zero. — Jester, Apr 03 '22 at 13:35
Note that specifying a cpu architecture that supports `tzcnt` the `xor` goes away. — Jester, Apr 03 '22 at 13:40
For experimentation, you can define an inline function like this: `static inline u64 ctzll(u64 a) { u64 q; asm ("rep bsf %1, %0" : "=r"(q) : "r"(a)); return (q); }` — fuz, Apr 03 '22 at 14:26
@Jester: `gcc -O3 -march=skylake` doesn't drop the `xor`: https://godbolt.org/z/fPnz1Gox5. On the other hand, `__builtin_ctzll` is documented as returning an undefined result when the input is 0, so the xor in principle wasn't needed in any case, but I guess gcc chooses to handle it by returning 0. clang on the other hand omits the xor on all architectures, so on an older CPU, `__builtin_ctzll(0)` will return garbage. — Nate Eldredge, Apr 03 '22 at 17:10
@NateEldredge GCC is kinda braindead sometimes: they carefully avoid the output dependency for `tzcnt` on generic or Intel, but code-gen seems oblivious of the true dependency of `bsf` if you compile with `-march=sandybridge` or something that doesn't support TZCNT. (In practice, Intel CPUs are compatible with what AMD documents: input = 0 leaves the output unmodified for bsf/bsr, so like cmov they need to dst reg as an input.) https://godbolt.org/z/WcY3sManq shows that, and that GCC also doesn't know that `-march=skylake` fixed the false dep for tz/lzcnt. Also a mysterious `cltq` — Peter Cordes, Apr 04 '22 at 02:33
Oh, the `cltq` is necessary because `__builtin_ctzll` is specified as returning an `int`, not `long long`. And it's just an undefined *result*, not *behaviour*, on input=0. — Peter Cordes, Apr 04 '22 at 02:47

user1937198 · Answer 1 · 2022-04-03T13:45:57.743

The two are not actually equivalent. Specifically they differ in the case of %rdi being 0.

It is semi-defined behaviour that bsf keeps the previous result of the destination if the input is 0: Why does breaking the "output dependency" of LZCNT matter?

This means that the bsf instruction has an input dependency on its output register. Zeroing the output register explicitly breaks that dependency and ensures the behavior is defined in this case. In most modern x86 implementations, zeroing with an xor happens during register renaming, and breaks the dependency chain. This means tha the xor itself is effectively free. It also means that the bsf can be dispatched to the execution units without waiting on previous uses of the eax register, which may be resulting in the performance difference you are seeing.

However, it is more likely that however you have put in the short assembly has hidden it from the optimizer, forcing the optimizer to make suboptimal choices around where the function would previously have been inlined.

xiver77 · Accepted Answer · 2022-04-03T16:34:09.787

Conclusion first, use __builtin_ctzll without converting the result to 64-bit. The following can be useful if you want to force the compiler to use tzcnt or if you want to make your own built-in.

Since @user1937198 explained all the basics, here's some code that is reasonably fast and portable.

static inline int u64_bsf(u64 x) {
    u64 r = 0;
    __asm__ ("rep bsfq\t%1, %0" : "+r"(r) : "r"(x));
    return r;
}

/*
u64_bsf:
        xorl    %eax, %eax
        rep bsfq        %rdi, %rax
        ret
*/

You can change the return type to unsigned if you want. On most platforms including x86, using int or unsigned produces the fastest code (except sometimes for array indices). On x86 specifically, using 64-bit integers cause software emulation in 32-bit mode, and larger code size (plus slower division) in 64-bit mode. Your use of 64-bit return type also confused the compiler to use a redundant cltq (which is a bad made-up name of cdqe in AT&T syntax).

rep bsf is decoded to tzcnt on machines that support it, and the rep is discarded on machines that don't. tzcnt is better in that it (usually) doesn't take the output register as input (see @user1937198's answer), and it runs a lot faster on AMD CPUs.

If you are targeting machines with tzcnt write

static inline int u64_tzcnt(u64 x) {
    u64 r;
    __asm__ ("tzcntq\t%1, %0" : "=r"(r) : "r"(x));
    return r;
}

/*
u64_tzcnt:
        tzcntq  %rdi, %rax
        ret
*/

Read GCC's online documentation for its inline assembly syntax (https://gcc.gnu.org/onlinedocs/gcc/Extended-Asm.html).

@zwol's answer mentioned constant folding. Here's the final code that can handle constant propagation and is inlined for a non-constant input.

static inline int u64_bsf(u64 x) {
    if (__builtin_constant_p(x)) {
        return __builtin_ctzll(x);
    }
    u64 r = 0;
    __asm__ ("rep bsfq\t%1, %0" : "+r"(r) : "r"(x));
    return r;
}

At this moment, I've basically reimplemented __builtin_ctzll, but you can always make your own intrinsic function this way.

I don't really see the point here for `__builtin_ctz` but this is a handy trick for cases where the compiler doesn't have a builtin already. — zwol, Apr 03 '22 at 15:17
@zwol Yeah, I noticed that I reimplemented `__builtin_ctz` after finishing this post. I edited the post with your point made clear. — xiver77, Apr 03 '22 at 16:30
*If you are targeting machines with tzcnt write* - no, if you're *building* on a machine with an assembler that knows about the `tzcnt` mnemonic, then you can just write it as `tzcntq` instead of `rep bsfq`. The resulting machine code will be identical either way, that's the whole point. — Peter Cordes, Apr 14 '22 at 01:05

Peter Cordes · Answer 3 · 2022-04-04T10:15:40.393

Partial answer covering things other existing answers didn't, like why GCC apparently wastes a cltq, and why xor-zeroing helps, and why GCC's code-gen with other options like -march=skylake or -march=sandybridge is not good.

The cltq (aka cdqe) is an unfortunate consequence of __builtin_ctzll() being defined as returning an int, not long long.

bsf %rdi, %rax either writes RAX with a number from 0..63, or leaves it unmodified (or on paper, holding an undefined value, but Intel CPUs in reality are compatible with the behaviour AMD documents: leave output register unmodified if input was 0 for bsf or bsr, unlike with tzcnt / lzcnt).

__builtin_ctzll() is only allowed to return valid int values, even for input = 0. (GNU C specifies the builtin as "If x is 0, the result is undefined." Not "the behaviour", it's not UB, you're still guaranteed to get some 32-bit int value. )

When GCC doesn't know for sure that rep bsf will run as tzcnt not bsf, it has to cover the possibility that the destination held old garbage with high bits that aren't copies of bit 31, if you return uint64_t instead of unsigned or int. (Returning a narrower type leaves it to the caller to ignore high garbage.)

In this case, where it's also xor-zeroing the destination, that guarantees a 0 output for a 0 input, so no need to sign-extend. Unless you want to play it safe in case Intel (or some software x86 emulator) stops doing what AMD documents and actually produces something other than the old value on input=0.

IIRC, rep bsf %edi, %eax on Intel or AMD (I forget which) does always truncate RAX to 32-bit even if it leaves EAX unmodified. But on the other vendor it doesn't. So fun fact, if targeting only CPUs that do that bare minimum zero-extension, (uint64_t)(uint32_t)__builtin_ctz(x) wouldn't need any extra work.

Output dependency for `bsf` always, `tzcnt` sometimes

GCC is kinda braindead sometimes: they carefully avoid the output dependency for tzcnt on generic or Intel, but code-gen seems oblivious of the true dependency of bsf if you compile with -march=sandybridge or something that doesn't support TZCNT. (In which case it just uses bsf, without xor-zeroing the destination.)

So apparently the xor-zeroing dep-breaking was only done in case it ran as TZCNT on Intel. Which is ironic because bsf always has an output dependency, on all CPUs, because in practice, Intel CPUs are compatible with what AMD documents: input = 0 leaves the output unmodified for bsf/bsr. So like cmov they need to dst reg as an input.

https://godbolt.org/z/WcY3sManq shows that, and that GCC also doesn't know that -march=skylake fixed the false dep for tz/lzcnt. (Why does breaking the "output dependency" of LZCNT matter?)

score 0 · Answer 4 · answered Apr 03 '22 at 14:56

You are likely to get better overall code generation with the builtin, not because of any details of the assembly generated by the builtin itself, but because the compiler knows what the builtin does. It knows the builtin has no side effects and doesn't access memory, so it can factor out repeated calls with the same argument. It knows how to "constant fold" through the builtin, so for instance it can replace __builtin_ctzll(0x123400) with 10. And so on.

With inline assembly, on the other hand, you tell the compiler which registers are read and written, but it has to make conservative assumptions about what the assembly does. It can't constant fold through inline assembly. It can't assume the inline assembly always produces the same result for the same input. Etc.

You can constant fold (with some limitations) using inline assembly. See the last part of my answer. — xiver77, Apr 03 '22 at 15:05

Inlining assembly in C

4 Answers4

Output dependency for bsf always, tzcnt sometimes

Output dependency for `bsf` always, `tzcnt` sometimes