ADX on llvm-mca: Is its reciprocal throughput 1 or 0.5?

Question

But llvm-mca returns

$ llvm-mca kkk -mcpu=skylake -timeline --timeline-max-iterations=10 --timeline-max-cycles=999

...
Timeline view:
                    0123456789          0123456789          0123456789   
Index     0123456789          0123456789          0123456789          012

[0,0]     DeER .    .    .    .    .    .    .    .    .    .    .    . .   adcxq   %rax, %rbx
[0,1]     D=eER.    .    .    .    .    .    .    .    .    .    .    . .   adoxq   %rcx, %rdx
[0,2]     D==eER    .    .    .    .    .    .    .    .    .    .    . .   adcxq   %rsp, %rbp
[0,3]     D===eER   .    .    .    .    .    .    .    .    .    .    . .   adoxq   %rsi, %rdi
[0,4]     D====eER  .    .    .    .    .    .    .    .    .    .    . .   adcxq   %r8, %r9
[0,5]     D=====eER .    .    .    .    .    .    .    .    .    .    . .   adoxq   %r10, %r11
[1,0]     .D=====eER.    .    .    .    .    .    .    .    .    .    . .   adcxq   %rax, %rbx
[1,1]     .D======eER    .    .    .    .    .    .    .    .    .    . .   adoxq   %rcx, %rdx
[1,2]     .D=======eER   .    .    .    .    .    .    .    .    .    . .   adcxq   %rsp, %rbp
[1,3]     .D========eER  .    .    .    .    .    .    .    .    .    . .   adoxq   %rsi, %rdi
[1,4]     .D=========eER .    .    .    .    .    .    .    .    .    . .   adcxq   %r8, %r9
[1,5]     .D==========eER.    .    .    .    .    .    .    .    .    . .   adoxq   %r10, %r11

which only executes one instruction every cycle. Why?

It should be sure that there's some bug in llvm-mca:

Index     0123456789 

[0,0]     DeER .    .   adcxq   %rax, %rbx
[0,1]     D=eER.    .   adoxq   %rcx, %rdx
[0,2]     D==eER    .   adcxq   %rsp, %rbp
[0,3]     D===eER   .   adoxq   %rsi, %rdi
[0,4]     .D===eER  .   adcxq   %r8, %r9
[0,5]     .D====eER .   adoxq   %r10, %r11
[0,6]     .DeE----R .   decq    %r15
[0,7]     .D=eE---R .   jne z
[1,0]     . DeE---R .   adcxq   %rax, %rbx
[1,1]     . D=eE--R .   adoxq   %rcx, %rdx
[1,2]     . D==eE-R .   adcxq   %rsp, %rbp
[1,3]     . D===eER .   adoxq   %rsi, %rdi
[1,4]     .  D===eER.   adcxq   %r8, %r9
[1,5]     .  D====eER   adoxq   %r10, %r11
[1,6]     .  DeE----R   decq    %r15
[1,7]     .  D=eE---R   jne z

After decq , [1,0]adcxq is claimed executed on cycle 3, while it relies on result from cycle 5. adoxq can be executed early though. ~~Looks like it's another thread as this also applies to an inc in adcq chain~~ llvm community confirmed that "We only have an EFLAGS register modeled." and fixing that should fix both

Probably a bug in LLVM-MCA, maybe it doesn't realize that CF is renamed separately from the SPAZO group of flags and can stay separate, and that ADCX and ADOX dependency chains truly can be independent. — Peter Cordes, May 02 '23 at 02:27
@PeterCordes https://www.agner.org/optimize/instruction_tables.pdf claims 1 for SkylakeX and not mentioned for Skylake. But I don't know if it's because they used all adcx rather than alternatively — l4m2, May 02 '23 at 06:52
https://uops.info/ shows experimental proof of their 0.5c throughput, for example https://uops.info/html-tp/SKL/ADOX_R64_R64-Measurements.html has the instruction sequences they used for microbenchmarking on SKL to observe 0.53 `adox` instructions execute per cycle. They used `test reg,reg` as a dependency breaker between `adox` instructions. (Over-zealously, out-of-order exec means one every few instructions is fine.) Agner Fog typically doesn't consider dep-breaking instructions when reporting throughput, only the throughput of that instruction on its own. — Peter Cordes, May 02 '23 at 12:29
That test alone doesn't *prove* that ADCX and ADOX can interleave with each other when you don't have dep-breaking instructions, since `adc` can also run in 0.5c that way as well. But given the known facts of how partial-flags works on Broadwell and later (never needing a merging uop, uops that need both parts of FLAGS just having 2 separate input operands for them, hence `cmovbe` being slow), there's no reason to expect a problem. — Peter Cordes, May 02 '23 at 12:37
Just to be sure, I tested your block of asm on my own SKL (with a `dec %r15d` / `jnz loop` at the bottom), and found it ran at 2.1 IPC, or 1.8 ADCX/ADOX per clock. (The loop branch competes for port 6 with ADCX/ADOX, so that's 13 uops for p06 per 6.684 cycles, averaging 1.94 uops per cycle for those ports, close to the theoretical max of 2. And definitively proving that it's not limited to 1, not false dependency between dep chains.) — Peter Cordes, May 02 '23 at 12:39
@PeterCordes I ran `L0:adcxq %rax, %rbx;adoxq %rcx, %rdx;adcxq %rsp, %rbp;adoxq %rsi, %rdi;adcxq %r8, %r9;adoxq %r10, %r11;adcxq %r12, %r13;adoxq %r14, %r15;jmp L0;` and kill manually. IPC at highest reach 2.04 but what's the extra limitation? otherwise it's 2.25 — l4m2, May 02 '23 at 15:58
Imperfect scheduling of those dependency chains (through CF and SPAZO) will occasionally lose a cycle. So that accounts for less than 2 uops per clock cycle. (`jmp` competes for port 6, same as a macro-fused `dec/jnz` uop, so loop overhead takes some throughput away from ADCX / ADOX. IDK how you got more than 2 IPC for your loop when they all compete for ports 0 and 6. Were you maybe timing with RDTSC and looking at reference cycles, not core clock cycles? I used `perf stat` to count the `cycles` even for a static executable) — Peter Cordes, May 02 '23 at 17:11
Ok, you should have said you were testing on a completely different microarchitecture than the one you were asking about for the accuracy of LLVM-MCA. https://uops.info/ shows ADCX/ADOX having 2/clock throughput on Zen as well, but doesn't show port breakdowns. Assuming your perf results are accurate, that would indicate that things are different from Intel, that taken branches don't compete for one of the two ports that can run ADOX / ADCX, otherwise you couldn't get above 2 IPC (with a jmp, rather than dec/jnz). — Peter Cordes, May 02 '23 at 19:40
Just for the record, I only now realized that `dec` breaks the dependency chain through OF. Oops. :P `lea` / `jrcxz` and the `loop` instruction are both slow on Intel. `loop` is fast on AMD. So your `jmp` was a smart way to do it. On my Skylake, I get 1.95 IPC with loop using `jmp`. Out of a theoretical max 2.0 IPC because JMP competes with ADO/CX for back-end ports. — Peter Cordes, May 03 '23 at 03:16
@PeterCordes Can `adx` fuse with `jcc` so every uop is doing `adx`? — l4m2, May 03 '23 at 03:16
No, `uops_issued.any` matches `instructions` with `adcx` / `jnz` at the bottom of a loop. I didn't expect it would; only instructions that are a lot like `test` and `cmp` can macro-fuse. (AND, and ADD/SUB, not ones like ADC that need to work differently, e.g. having 3 inputs.) [x86\_64 - Assembly - loop conditions and out of order](https://stackoverflow.com/q/31771526) has a table. — Peter Cordes, May 03 '23 at 03:22
@PeterCordes The time travelling carry flag seems to show that llvm-mca treat rflags as a whole, which explains both result in question — l4m2, May 03 '23 at 03:40
And it thinks `dec` writes the whole RFLAGS, breaking the dependency on CF? If so, it probably thinks `dec` is dep-breaking between `add %al, %al` and `setc %al` or something. Most code doesn't use CF across inc/dec, so it's certainly possible that this bug / oversimplification in LLVM-MCA exists, because it wouldn't stop it from making mostly-correct analysis for most cases. — Peter Cordes, May 03 '23 at 03:44

ADX on llvm-mca: Is its reciprocal throughput 1 or 0.5?

0 Answers0