Can gcc emit code as efficient as clang for the binary tree "LowerBound" algorithm?

Question

I have been implementing various node based binary search trees using C-ish C++ code. When benchmarking these I have noticed surprisingly large performance variations both across compilers and in response to small code changes.

When I focused on insertion and removal in a tree that allowed duplicates (as a C++ std::multiset<int> would), I found that almost all the time is spent zig-zagging down the tree's left and right pointers in operations like "find" and "lower_bound" rather than the conceptually "expensive" rebalancing steps that occur after inserts and deletes.

So I began to focus on one case in particular: lower bound.

// Node is a binary tree node.  It has the
// usual left and right links and an
// integral key.
struct Node {
    int key;
    Node* links[2];
};

// LowerBound returns the first node in
// the tree rooted at "x" whose key is
// not less than "key", or null if there
// is no such key.
Node* LowerBound(Node* x, int key) {
  Node* lower = nullptr;
  while (x != nullptr) {
    bool x_gte = !(x->key < key);
    lower = x_gte ? x : lower;
    x = x->links[!x_gte];
  }
  return lower;
}

A few points and observations:

I am on an AMD Ryzen 9 5900X 12-Core. ~~My understanding is that the conditional move (cmov) instructions are faster on AMD than on Intel~~ (my understanding was wrong, see Peter Cordes' comment on this post), but I find that when I spot check results on my 8 year old Intel laptop the code that is faster on AMD is faster on Intel too.
I am running Linux. I've turned off hyperthreading, boost mode, and set the cpu scaling governor to "performance" using this script I wrote. The performance numbers are stable with little variation.
The code above is the end of several optimization iterations. I have a benchmark (code here) that exercises various tree sizes, allocating nodes in an array according to either a random or ascending by key order, then writes a key access pattern to another array, and runs through them repeatedly. The key access patterns are either ascending or random. In larger trees, code that uses branches, rather than cmov or similar, is often much slower.
One key optimization seems to be using an array of links (Node links[2]) in the node instead of explicit left and right pointers. With explicit fields gcc is very quick to switch to branchy code, which is slower. With the links array gcc will index it as I have written.
In fact, when I use gcc's profile guided optimization it still switches to branch based code, for a 1.5x to 2x performance loss.
In all cases, except for very tiny trees where branchy code can win, clang generates faster code for this function.

With the code above on godbolt we can see clang generating the following:

LowerBound(Node*, int):
        xorl    %eax, %eax
        testq   %rdi, %rdi
        je      .LBB0_3
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        xorl    %ecx, %ecx
        cmpl    %esi, (%rdi)
        setl    %cl
        cmovgeq %rdi, %rax
        movq    8(%rdi,%rcx,8), %rdi
        testq   %rdi, %rdi
        jne     .LBB0_1
.LBB0_3:
        retq

while gcc is doing worse:

LowerBound(Node*, int):
        xorl    %eax, %eax
        testq   %rdi, %rdi
        je      .L5
.L4:
        cmpl    %esi, (%rdi)
        setl    %dl
        cmovge  %rdi, %rax
        movzbl  %dl, %edx
        movq    8(%rdi,%rdx,8), %rdi
        testq   %rdi, %rdi
        jne     .L4
        ret
.L5:
        ret

The gcc variant is roughly 2x slower on my machine (the geomean of the timings with tree heights 1 to 18). Can this be explained in a simple way? I notice that clang is clearing %ecx first, then sets %cl, then uses %ecx, whereas gcc sets %dl and then moves it to %edx before using %rdx.

gcc's approach is equivalent logically, much slower in practice. Can it be improved?

Would you be able to share the entire testbench? This way I would be just speculating. — Something Something, Oct 06 '22 at 23:00
*(cmov) instructions are faster on AMD than on Intel* - That stopped being true with Broadwell and Skylake, so nearly a decade ago. They're single uop on Intel. (Except for `cmovbe` / `cmova` which need CF *and* ZF from the SPAZO group, so they have 4 inputs and need 2 uops.) GCC's problem here is ironically [partial-register false dependencies](https://stackoverflow.com/questions/41573502/why-doesnt-gcc-use-partial-registers) from writing `DL` and *then* doing a `movzx`; normally GCC is more careful and clang is cavalier. (@HenriqueBucher's answer shows the consequences.) — Peter Cordes, Oct 07 '22 at 13:59
There's a section in my answer on [What is the best way to set a register to zero in x86 assembly: xor, mov or and?](https://stackoverflow.com/q/33666617) about `xor`-zero / set FLAGS / `setcc cl` like clang is doing, vs. `setcc dl` / `movzx edx,dl` like GCC is doing. Especially silly that GCC defeats mov-elimination by extending within the same register, making the critical path latency longer. — Peter Cordes, Oct 07 '22 at 14:03
(actually, AMD doesn't do mov-elimination of `movzx`, only Intel does that. And even with LLVM's way, there's still a loop carried dependency; as you say you avoided branching. GCC's way makes it 1 cycle longer than necessary, which is bad when it's only load-use latency + cmp + setcc (+movzx). Hmm, and maybe a 2nd load-use latency as part of the cmp? An extra 1 in 6 or 1 in 10 cycles doesn't explain a 2x difference, so perhaps there's some other less obvious effect as well.) — Peter Cordes, Oct 07 '22 at 15:13
There's a canonical about the two strategies for materializing a FLAG condition as a 32-bit integer: [Why XOR before SETcc?](https://stackoverflow.com/q/68034464) — Peter Cordes, Oct 17 '22 at 11:23

Something Something · Accepted Answer · 2022-10-07T21:34:22.210

5

Using llvm-mca, which is a tool from the LLVM suite to analyze the machine code for a given architecture, we can see that indeed there is a difference.

For the Intel Skylake architecture the code generated by GCC versus LLVM:

Instructions:      1200 vs 1200 
Total Cycles:      1305 vs 1205
Total uOps:        1700 vs 1400

For the AMD Zen3 architecture the code generated by GCC versus LLVM:

Instructions:      1200 vs 1100 
Total Cycles:      1205 vs 1105
Total uOps:        1200 vs 1100

The average wait times for GCC were 20% higher

Average Wait times (based on the timeline view):
[0]: Executions
[1]: Average time spent waiting in a scheduler's queue
[2]: Average time spent waiting in a scheduler's queue while ready
[3]: Average time elapsed from WB until retire stage

      [0]    [1]    [2]    [3]
0.     3     0.0    0.0    12.0      xorl   %eax, %eax
1.     3     11.0   0.3    0.7       testq  %rdi, %rdi
2.     3     12.0   0.0    0.0       je .L5
3.     3     11.0   0.3    0.0       cmpl   %esi, (%rdi)
4.     3     16.0   0.0    0.0       setl   %dl
5.     3     17.0   0.0    0.0       movzbl %dl, %edx
6.     3     15.0   0.0    1.0       cmovgeq    %rdi, %rax
7.     3     17.0   0.0    0.0       movq   8(%rdi,%rdx,8), %rdi
8.     3     22.0   0.0    0.0       testq  %rdi, %rdi
9.     3     23.0   0.0    0.0       jne    .L4
10.    3     1.0    1.0    18.0      retq
11.    3     1.7    1.7    17.3      retq
       3     12.2   0.3    4.1       <total>

Against the code generated by LLVM

Average Wait times (based on the timeline view):
[0]: Executions
[1]: Average time spent waiting in a scheduler's queue
[2]: Average time spent waiting in a scheduler's queue while ready
[3]: Average time elapsed from WB until retire stage

      [0]    [1]    [2]    [3]
0.     3     0.0    0.0    11.7      xorl   %eax, %eax
1.     3     10.3   0.3    0.7       testq  %rdi, %rdi
2.     3     11.0   0.0    0.0       je .LBB0_3
3.     3     0.0    0.0    12.0      xorl   %ecx, %ecx
4.     3     10.0   0.3    0.0       cmpl   %esi, (%rdi)
5.     3     15.0   0.0    0.0       setl   %cl
6.     3     14.7   0.0    0.0       cmovgeq    %rdi, %rax
7.     3     15.3   0.0    0.0       movq   8(%rdi,%rcx,8), %rdi
8.     3     20.0   0.0    0.0       testq  %rdi, %rdi
9.     3     21.0   0.0    0.0       jne    .LBB0_1
10.    3     1.0    1.0    16.0      retq
       3     10.8   0.2    3.7       <total>

We can see also that the resource pressure per iteration on GCC is much higher


Resources:
[0]   - Zn3AGU0
[1]   - Zn3AGU1
[2]   - Zn3AGU2
[3]   - Zn3ALU0
[4]   - Zn3ALU1
[5]   - Zn3ALU2
[6]   - Zn3ALU3
[7]   - Zn3BRU1
[14.0] - Zn3LSU
[14.1] - Zn3LSU
[14.2] - Zn3LSU
[15.0] - Zn3Load
[15.1] - Zn3Load
[15.2] - Zn3Load

Resource pressure per iteration:
[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]    
1.33   1.33   1.34   3.33   1.35   1.65   2.65   2.02   

[14.0] [14.1] [14.2] [15.0] [15.1] [15.2] 
 1.33   1.33   1.34   1.33   1.33   1.34

Against LLVM

[0]    [1]    [2]    [3]    [4]    [5]    [6]    [7]  
1.00   1.00   1.00   2.55   0.99   1.01   2.50   1.95

[14.0] [14.1] [14.2] [15.0] [15.1] [15.2] 
 1.00   1.00   1.00   1.00   1.00   1.00

It looks like the LLVM compiler does a much better job of optimizing the pipeline pressure.

Compiler Explorer (original): https://godbolt.org/z/abde4vv7r
Compiler Explorer (ret instructions excluded as suggested by Peter Cordes, select zen3 arch): https://godbolt.org/z/4K6814chq

If you are interested in only certain portions of the execution as the inner loop, you can mark the regions to be analized as in

Node* LowerBound(Node* x, int key) {
  Node* lower = nullptr;
  while (x != nullptr) {
    __asm volatile("# LLVM-MCA-BEGIN foo":::"memory");
    bool x_gte = !(x->key < key);
    lower = x_gte ? x : lower;
    x = x->links[!x_gte];
    __asm volatile("# LLVM-MCA-END foo":::"memory");
  }
  return lower;
}

This brings total cycles to 1303 for GCC and 1203 for LLVM.

Compiler Explorer: https://godbolt.org/z/8KoKfab34

edited Oct 07 '22 at 21:34

answered Oct 06 '22 at 23:39

Something Something

3,999
1
6
21

It looks like you told LLVM-MCA to analyze the *whole function* as a loop body, not the actual loop body. Note the two `ret` instructions you're counting for GCC's version, and one in LLVM's. The actual difference is that GCC's critical-path latency is higher by 1 cycle, because it picked an inefficient way to make a 0 / 1 integer from a FLAGS condition, including a `movzx` where mov-elimination won't work because it's extending within the same register. – Peter Cordes Oct 07 '22 at 15:07
But this is AMD, so mov-elim wasn't an option, only xor/cmp/stcc – Peter Cordes Oct 07 '22 at 16:20
@Matt - good edit. So yeah, we're still getting LLVM's 1105c vs. 1205c for 100 iters. That looks right, one extra cycle per iteration, since there already is a loop-carried dependency that includes two load-use latencies so bumping up from 11 to 12 looks about right. (LLVM-MCA assumes L1d hits, so about 4 cycles I think). It would be faster to unconditionally load both pointers and `cmov` to select one, since that could happen in parallel with the load that feeds the `cmp`, instead of not even being able to start until after cmp/setcc. (Everything else would stay the same, but 4c faster) – Peter Cordes Oct 07 '22 at 17:37
@Matt: The fact we still got sensible loop performance calculations with the code before/after the loop as part of it is because that code happened not to mess up the critical path, and the bottleneck was latency. So the extra throughput work of doing `ret`s wasn't contributing to the bottleneck. (LLVM-MCA is somewhat simplistic, but the more sophisticated uiCA agrees for Skylake. It only models Intel CPUs, but models the front-end in detail. https://uica.uops.info/) – Peter Cordes Oct 07 '22 at 17:41
@Matt: yeah, parallel loads and cmov to select between them brings the c / iter down to 7 on Skylake. https://uica.uops.info/?code=.LBB0_1%3A%0D%0Acmp%20%20%25esi%2C%20(%25rdi)%0D%0Amov%20%2016(%25rdi)%2C%20%25r8%0D%0Amov%20%208(%25rdi)%2C%20%25rdi%0D%0Acmovge%20%20%25r8%2C%20%25rdi%0D%0Acmovge%20%25rdi%2C%20%25rax%0D%0Atest%20%20%20%25rdi%2C%20%25rdi%0D%0Ajne%20%20%20%20%20.LBB0_1&syntax=asATT&uArchs=SKL&tools=uiCA&alignment=0&uiCAHtmlOptions=traceTable. Should be the same on AMD, since it seems SKL has the same L1d load-use latency and other latencies as Znver3. – Peter Cordes Oct 07 '22 at 17:51
@PeterCordes, thanks for distracting yourself with this. With https://godbolt.org/z/Y1W9fYdxT I've confirmed that if I force clang's loop body assembler into an otherwise gcc compiled binary the run times are nearly identical in my benchmark harness. I assume you did these tests by editing the assembler directly? If I try to do the loads of left and right early and use a ternary at the C level, clang puts the loads after the key comparison and gcc switches to branches. That is the second half of my question: how to get gcc to "behave" here, though I suspect the answer is "you can't". – MattArmstrong Oct 07 '22 at 18:21
@MattArmstrong: Yup, copy/pasted the loop body into https://uica.uops.info/ and hand-edited. (Without worrying too much about whether `cmovge` or `cmovl` was correct; same performance). I haven't yet tried to hand-hold GCC or clang into making asm like that, perhaps by using two `T *p1 = p->left, *p2 = p->right` temporaries. And if that doesn't work, you can make those accesses `volatile` to force them both to happen; That could be optimization-defeating but unlikely that any real code will have compile-time-constant known left or right members. Except maybe NULL, hmm. – Peter Cordes Oct 07 '22 at 18:23
You could also mark the sections within the code that you want to analyze. I edited the answer to show how it is done. – Something Something Oct 07 '22 at 21:36
1

It's normally better to put the `asm volatile` statements *outside* the loop like Matt already did in his edit to your answer. That avoids interfering with loop unrolling or other optimizations, plus it makes sure the loop condition is part of what's being analyzed, because it does run every iteration. Hrm, but on 2nd look, @Matt's version with `asm()` comment fences outside the loop included the before-first-iteration stuff, like `xorl %eax, %eax` and test/je to skip the loop if the pointer is initially NULL. It did omit the `ret` at the bottom of the function, though. – Peter Cordes Oct 07 '22 at 22:00
I did not see those. – Something Something Oct 07 '22 at 23:51
1

I'll go ahead and accept this answer because it is full of great stuff. The root of the answer is still open, however. Can gcc generate code that runs as fast as the code clang emits? That answer seems to be that there is no obvious way -- you have to resort to inline assembler in this case. – MattArmstrong Oct 13 '22 at 00:53
@MattArmstrong tbh I felt the gcc team at this point is very reactive. Gcc is being slowly deprecated as llvm is thriving. They dont even have a discord server for example. – Something Something Oct 13 '22 at 01:27

Can gcc emit code as efficient as clang for the binary tree "LowerBound" algorithm?

1 Answers1