Changing add
to adc
in the highlighted line below significantly improves performance. I find it very counter-intuitive, since add
has more ports to execute and it does not depend on flags.
CPU: Intel i7-9750H (Coffee Lake).
UOPS_ISSUED.ANY for add
= ~2.87 uops/cycle.
UOPS_ISSUED.ANY for adc
= ~3.47 uops/cycle.
Retire slots is 98.5% of uops issued in both cases.
And it's reflected in benchmark times, add
version is a lot slower.
I would appreciate if someone could help me understand why add
is slower? I can provide more metrics, just don't know what to look for.
# Code to multiply large integer by a qword.
# RSI = input large integer (qword array).
# RDI = output large integer (qword array).
# RDX = qword to multiply the large integer by.
# ECX = number of 32-byte blocks to process (i.e. qwords count / 4).
# RAX = carry
xor eax, eax
.p2align 4
L_mul_word_loop:
mulx r9, r8, [rsi]
add r8, rax ### <-- "adc r8, rax" (+20-30% performance)
mov [rdi], r8
mulx rax, r8, [rsi+8] # rax:r8 = RDX * [rsi+8]
adc r9, r8 # add high-half of last mulx to low half of this
mov [rdi+8], r9 # and store.
mulx r9, r8, [rsi+16]
adc r8, rax
mov [rdi+16], r8
mulx rax, r8, [rsi+24]
adc r9, r8
mov [rdi+24], r9
adc rax, 0 # include CF into high half of last mulx: can't wrap
add rdi, 32
add rsi, 32
dec ecx
jnz L_mul_word_loop
The loop-carried dependency chain is through CF between the add
-> adc
instructions, and in RAX
from the bottom of the loop back to the top. (The largest possible product has a high half less than 0xFF...
, so the final adc rax, 0
can't wrap and produce its own carry-out. e.g. 0xFF * 0xFF = 0xFE01
). This dependency chain is 5 cycles long.
(Experimentally, see comments, using lea
instead of add
for pointer increments and removing the adc rax, 0
to shorten it to 4 cycles isn't faster than this version using adc
at the top of the loop.)
Minimal reproducible code: add-adc.asm
Full code on GitHub (only tested on Mac; older version was working on Ubuntu, although I haven't checked in a while). I'm observing this effect on this test:
build/benchmark_asm_x86_64 --benchmark_filter=mul_word/100 --benchmark_repetitions=3 --benchmark_report_aggregates_only=true
Update
I added a minimal reproducible example link (single ASM file). Tried to gather per-iteration metrics. Note that outer loop iteration count is much higher than inner loop (want to keep arrays small enough for L1 cache), hopefully that does not skew the numbers too much. Here's what I got:
add
per inner loop iteration:
~ 6.88 cycles
~ 20.00 uops_issued.any
~ 4.83 uops_dispatched_port.port1
adc
per inner loop iteration:
~ 6.12 cycles
~ 20.11 uops_issued.any
~ 4.34 uops_dispatched_port.port1