4

For this fragment of code (https://godbolt.org/z/s4PY44dha)

int foo(unsigned long long x)
{ 
    return _lzcnt_u64(x);
}

GCC generates 3 asm instructions

xorl    %eax, %eax
lzcntq  %rdi, %rax
ret

while clang generates only 2

lzcntq  %rdi, %rax
retq

Is it possible to change the implementation/signature of foo to help GCC understand that this xor instruction is useless? Why can't gcc perform such simple optimization itself?


The answer to this question Why does breaking the "output dependency" of LZCNT matter? explains that this xor may be useful for some old architectures to break so-called "false dependency" on the destination register. It even mentions that the issue it is supposed to fix is not present in the modern intel architectures starting from "Skylake-S (client)". I tried to pass newer architectures to the GCC (for example -march=rocketlake, -march=icelake-client) but it still inserts "useless" xor.

In contrast, even for old architectures like haswell clang doesn't insert xor. This means that if one wants to get each bit of performance for certain architecture, then the insertion of xor should be controlled manually.

For example, with this inline assembly, I managed to get the code without xor.

int xorless_lzcntq(unsigned long long x) {
    unsigned long long res;
    asm ("lzcntq %1, %0" : "=r"(res) : "r"(x));
    return res;
}
Curious
  • 507
  • 3
  • 16
  • 6
    The xor is not useless. It breaks a false dependency on the destination register on some micro architectures. – fuz Jun 21 '21 at 11:37
  • 2
    so the question should be how do I get clang to add the xor? – old_timer Jun 21 '21 at 11:52
  • 2
    Does this answer your question? [Why does breaking the "output dependency" of LZCNT matter?](https://stackoverflow.com/questions/21390165/why-does-breaking-the-output-dependency-of-lzcnt-matter) – Daniel Langr Jun 21 '21 at 11:52
  • 1
    Note that for different cpu e.g. `-march=znver2` the `xor` is not emitted. – Jester Jun 21 '21 at 11:55
  • @DanielLangr It answers it partially, at least now I understand why this xor is used. Apparently, gcc and clang are not perfect in the detection when it is required. It would be nice to control the insertion/omission of xor manually. Probably my best option is to use inline assembly. – Curious Jun 21 '21 at 12:00
  • 1
    @Curious Quite honestly, you shouldn't worry about this. The `xor` is a zeroing idiom and handled entirely in the front end. So unless your code is bottlenecked on frontend throughput (and it likely isn't), this is not the place you need to optimise. – fuz Jun 21 '21 at 12:25
  • 1
    Or just use `-march=znver2` for this file, if `-march=skylake` doesn't know that SKL fixed the lzcnt/tzcnt false deps (but not popcnt). Inline asm defeats constant propagation, and the compiler won't know that the result is a small non-negative integer in the range 0..64. That might or might not matter in any given use-case. Also note that in a larger function, GCC can `lzcnt same,same` to sidestep the false dep without xor. – Peter Cordes Jun 21 '21 at 12:26
  • @old_timer: clang does use xor when it can see a loop within one function where this would create a loop-carried dependency. But it's reckless about it otherwise, and assumes that the incoming RAX wasn't just used as a temporary result of a cache-miss load in the caller for example. – Peter Cordes Jun 21 '21 at 12:29
  • @PeterCordes I afraid that if I use `-march=znver2` then I will potentially lose other architecture-specific optimizations. – Curious Jun 21 '21 at 12:36
  • I can't think of a feature SKL has that Zen2 doesn't, which GCC would use directly. But more importantly would make other tuning choices. But I was suggesting using it for just one file. You could use `-march=native -mtune=znver2` - Zen2 has some similarity to Skylake, having full-width 256-bit vector hardware, although other tuning choices may differ. Since getting perfect behaviour is apparently impossible with current GCC `-mtune=skylake`, it's worth benchmarking both ways to see which is faster for any specific code you might care about. – Peter Cordes Jun 21 '21 at 12:41
  • Report it here. https://gcc.gnu.org/bugzilla/show_bug.cgi?id=62011 – Валерий Заподовников Aug 13 '22 at 06:30

0 Answers0