Does a series of x86 call/ret instructions form a dependent chain?

Question

Consider the following x86-64 assembly:

inner:
   ...
   ret

outer:
.top:
   call inner
   dec  rdi
   jnz  .top
   ret

The function outer simply repeatedly makes a call to the function inner (whose body isn't shown - it may be empty).

Does the series of call instructions in outer, and the corresponding ret instructions inside inner form a dependent chain in practice (for the purposes of estimating performance)?

There is more than one way this chain could be formed. For example, does the ret depend on the latency of the preceding call instruction and then does the subsequent call instruction depend on the ret, forming a call -> ret -> call chain? Or perhaps the ret is independent but the call is not, forming a call -> call chain? If there is a chain, is it through memory, a register, the stack engine, the return address predictor¹, or what?

Motivation: This question originated from a series of comments on another question, mostly this comment and earlier ones.

¹ The terminology might be somewhat unclear here: the stack engine is normally understood to handle transforming rsp-modifying instructions into a single access with an appropriate offset, so that push rax; push rbx might be transformed into something like mov [t0], rax; mov [t0 - 8], rbx where t0 is some temporary register that captured the value of rsp at some point. It also understood to handle a similar transformation for call and ret instructions, which both modify the stack in a way similar to push and pop as well as including a direct, indirect (respectively) jump. The CPU also includes a mechanism to predict that return indirect jump, which some lump under "stack engine" - but here I'm separating that out into "return address predictor".

I usually indent instructions by one stop (4 or 8 spaces), regardless of context. Labels and directives have no leading space. Sometimes I indent inner-loop labels by a space or two, if there are other labels nearby. I don't try to make it look like structured language indenting, but it's nice to be able to spot branch targets vs. instructions. I often also leave blank lines around groups of logically separate instructions within the same basic block (if I haven't optimized by interleaving everything). — Peter Cordes, Mar 09 '18 at 23:57
All assembly code I've ever read (or written) is indented as @Peter describes (except that pseudo-ops are often indented the same as opcodes). I've never seen any like your example before. — prl, Mar 10 '18 at 03:06
This is hard to test because of front-end bottlenecks on taken branches. I'm playing around with `call` => `pop rax` / `jmp after_call`. In an unrolled loop with 4 `call` instructions and 4 call-targets with hard-coded `jmp` returns, I found that putting `align 16` between calls and between functions was needed to avoid `dsb2mite_switches.penalty_cycles`. But I have a hard time getting it to go faster than with obviously data-dependent `push rbx` / `pop rbx` pairs and direct `jmp`, which would prove that it bypasses a store/reload bottleneck. — Peter Cordes, Mar 10 '18 at 04:17

score 7 · Accepted Answer · answered Mar 10 '18 at 05:30

No, branch-prediction + speculative execution break the store/reload dependency.

RIP is (speculatively) known by the front-end, from the return-address predictor. The next call instruction can thus push a return address without waiting for the ret to execute (and actually load and verity the correctness of the predicted return address against the data from the stack).

Speculative stores can enter the store buffer and be store-forwarded.

There is of course a dependency chain, it's not loop-carried. Out-of-order execution hides it by keeping many iterations in flight.

Proof: call's store breaks what would otherwise be a loop-carried memory dependency chain.

align 64
global _start
_start:
    mov     ebp, 250000000   ; I had been unrolling by 4, should have changed this to 5000... before measuring, but forgot.

align 32
.mainloop:
    call  delay_retaddr
    call  delay_retaddr

    dec ebp
    jg .mainloop

    xor edi,edi
    mov eax,231   ; __NR_exit_group  from /usr/include/asm/unistd_64.h
    syscall       ; sys_exit_group(0)

    ;; Placing this function *before* _start, or increasing the alignment,
    ;; makes it somewhat slower!
align 32
delay_retaddr:
    add qword [rsp], 0
    add qword [rsp], 0    ; create latency for the ret addr
    ret

Assemble and link with yasm -felf64 -Worphan-labels -gdwarf2 foo.asm && ld -o foo foo.o, producing a static ELF binary.

Profiled (on an i7-6700k) with ocperf.py, I get 0.99 instructions per core clock cycle:

$ taskset -c 3 ocperf.py stat -etask-clock,context-switches,cpu-migrations,page-faults,cycles,branches,instructions,uops_issued.any,uops_executed.thread,dsb2mite_switches.penalty_cycles -r2 ./foo

 Performance counter stats for './foo' (2 runs):

        645.770390      task-clock (msec)         #    1.000 CPUs utilized            ( +-  0.05% )
                 1      context-switches          #    0.002 K/sec                    ( +-100.00% )
                 0      cpu-migrations            #    0.000 K/sec
                 2      page-faults               #    0.004 K/sec                    ( +- 20.00% )
     2,517,412,984      cycles                    #    3.898 GHz                      ( +-  0.09% )
     1,250,159,413      branches                  # 1935.919 M/sec                    ( +-  0.00% )
     2,500,838,090      instructions              #    0.99  insn per cycle           ( +-  0.00% )
     4,010,093,750      uops_issued_any           # 6209.783 M/sec                    ( +-  0.03% )
     7,010,150,784      uops_executed_thread      # 10855.485 M/sec                   ( +-  0.02% )
            62,838      dsb2mite_switches_penalty_cycles #    0.097 M/sec                    ( +- 30.92% )

       0.645899414 seconds time elapsed                                          ( +-  0.05% )

With the called function before _start, and alignment values of 128, IPC can go down from 0.99 to 0.84, which is super-weird. Counts for dsb2mite switches are still low-ish, so it's mostly still running from the uop cache, not the legacy decoders. (This Skylake CPU has the microcode update that disables the loop buffer, in case that would be relevant with all this jumping.)

To sustain good throughput, the CPU has to keep many iterations of the inner loop in flight because we've significantly lengthened the independent dep chains that need to overlap.

Changing the add [rsp], 0 instructions to [rsp+16] creates a loop-carried dependency chain on a different location, which isn't being stored to by call. So the loop bottlenecks on that store-forwarding latency and runs at ~half speed.

# With  add qword [rsp+16], 0

 Performance counter stats for './foo' (2 runs):

   1212.339007      task-clock (msec)         #    1.000 CPUs utilized            ( +-  0.04% )
             2      context-switches          #    0.002 K/sec                    ( +- 60.00% )
             0      cpu-migrations            #    0.000 K/sec                  
             2      page-faults               #    0.002 K/sec                  
 4,727,361,809      cycles                    #    3.899 GHz                      ( +-  0.02% )
 1,250,292,058      branches                  # 1031.306 M/sec                    ( +-  0.00% )
 2,501,537,152      instructions              #    0.53  insn per cycle           ( +-  0.00% )
 4,026,138,227      uops_issued_any           # 3320.967 M/sec                    ( +-  0.02% )
 7,026,457,222      uops_executed_thread      # 5795.786 M/sec                    ( +-  0.01% )
       230,287      dsb2mite_switches_penalty_cycles #    0.190 M/sec                    ( +- 68.23% )

   1.212612110 seconds time elapsed                                          ( +-  0.04% )

Note that I'm still using an RSP-relative address so there's still a stack-sync uop. I could have kept both cases the same and avoided it in both by using an address relative to a different register (e.g. rbp) to address the location where call/ret store/reload the return address.

I don't think the variable latency of store-forwarding (worse in simple back-to-back reload right away cases) is sufficient to explain the difference. Adding a redundant assignment speeds up code when compiled without optimization. This is a factor of 2 speedup from breaking the dependency. (0.99 IPC vs. 0.53 IPC, with the same instructions just different addressing mode.)

The instructions are 1 byte longer with the disp8 in the addressing mode, and there was front-end weirdness with alignment in the faster version, but moving things around doesn't seem to change anything with the [rsp+16] version.

Using a version that creates a store-forwarding stall (with add dword [rsp], 0) makes the dep chain too long for OoO exec to hide easily. I didn't play around with this a huge amount.

This is pretty conclusive evidence that the stack slot `[rsp]` doens't form part of a circular chain between the `call` and `ret`. Why though do back-to-back `call/ret` always seem to take > 4 cycles then though? If everything is executing in parallel you have only 5 fused uops and probably should bottleneck on `p6` at 1 `call/ret` pair per 2 cycles? — BeeOnRope, Mar 11 '18 at 19:36
Perhaps there is a _hidden_ dependency through something like the RAS predictor, and/or some interaction with the front-end. For example, maybe the return-address prediction takes a few cycles to be generated, so the front-end is full of bubbles in such a dense call scenario which results in the observed throughput. As the surrounding code actually starts _doing_ something this becomes less important. — BeeOnRope, May 27 '18 at 22:21
@BeeOnRope: Other than loops, it seems SKL can only predict one taken branch per 2 cycles. This came up recently in comments. Also found [Why jnz requires 2 cycles to complete in an inner loop](https://stackoverflow.com/q/54156499) where Hadi says the DSB "seems to be only able to deliver one jump uop of the inner loop every other cycle" at least for that test case. — Peter Cordes, Jan 18 '22 at 17:31
Skylake-ish can do 1 taken jump per cycle, including forward jumps, but certain conditions have to be met. I wrote a bit about it [here](https://github.com/travisdowns/uarch-bench/blob/3eb58bba47d1cb647dd8bc317c3044161ba74a0a/x86-branch.asm#L6) though my investigation was incomplete, but IIRC it is enough that the forward jumps are spaced 32 bytes apart. That is, you can only have one "fast" jump in a 32-byte chunk, probably due to a limitation in the BTB. — BeeOnRope, Jan 18 '22 at 21:33
For small loops, those conditions are usually met because there is only one jump/target, but forward jump tests by construction need many, and people usually write it in such that several jumps fall into the same 32B chunk. In uarch-bench you can run `./uarch-bench.sh --test-name=branch/*` and the test ` 20 uncond jumps by 32 then 32 bytes` which is only forward jumps spaced 32 bytes apart and takes 20.95 cycles (the extra cycle is the loop back to the start I guess). — BeeOnRope, Jan 18 '22 at 21:37

Does a series of x86 call/ret instructions form a dependent chain?

1 Answers1

Linked