For this fragment of code (https://godbolt.org/z/s4PY44dha)
int foo(unsigned long long x)
{
return _lzcnt_u64(x);
}
GCC generates 3 asm instructions
xorl %eax, %eax
lzcntq %rdi, %rax
ret
while clang generates only 2
lzcntq %rdi, %rax
retq
Is it possible to change the implementation/signature of foo
to help GCC understand that this xor
instruction is useless? Why can't gcc perform such simple optimization itself?
The answer to this question Why does breaking the "output dependency" of LZCNT matter? explains that this xor
may be useful for some old architectures to break so-called "false dependency" on the destination register. It even mentions that the issue it is supposed to fix is not present in the modern intel architectures starting from "Skylake-S (client)". I tried to
pass newer architectures to the GCC (for example -march=rocketlake
, -march=icelake-client
) but it still inserts "useless" xor
.
In contrast, even for old architectures like haswell
clang doesn't insert xor
. This means that if one wants to get each bit of performance for certain architecture, then the insertion of xor
should be controlled manually.
For example, with this inline assembly, I managed to get the code without xor
.
int xorless_lzcntq(unsigned long long x) {
unsigned long long res;
asm ("lzcntq %1, %0" : "=r"(res) : "r"(x));
return res;
}