ADD slower than ADC in the first step of a bigint multiply on Coffee Lake (Skylake)

Question

Changing add to adc in the highlighted line below significantly improves performance. I find it very counter-intuitive, since add has more ports to execute and it does not depend on flags.

CPU: Intel i7-9750H (Coffee Lake).
UOPS_ISSUED.ANY for add = ~2.87 uops/cycle.
UOPS_ISSUED.ANY for adc = ~3.47 uops/cycle.
Retire slots is 98.5% of uops issued in both cases.

And it's reflected in benchmark times, add version is a lot slower.

I would appreciate if someone could help me understand why add is slower? I can provide more metrics, just don't know what to look for.

# Code to multiply large integer by a qword.
# RSI = input large integer (qword array).
# RDI = output large integer (qword array).
# RDX = qword to multiply the large integer by.
# ECX = number of 32-byte blocks to process (i.e. qwords count / 4).
# RAX = carry

xor eax, eax

.p2align 4
L_mul_word_loop:
  mulx r9, r8, [rsi]
  add r8, rax         ###  <-- "adc r8, rax" (+20-30% performance)
  mov [rdi], r8

  mulx rax, r8, [rsi+8]      # rax:r8 = RDX * [rsi+8]
  adc r9, r8                 # add high-half of last mulx to low half of this
  mov [rdi+8], r9            # and store.

  mulx r9, r8, [rsi+16]
  adc r8, rax
  mov [rdi+16], r8

  mulx rax, r8, [rsi+24]
  adc r9, r8
  mov [rdi+24], r9

  adc rax, 0                # include CF into high half of last mulx: can't wrap
  add rdi, 32
  add rsi, 32

  dec ecx
  jnz L_mul_word_loop

The loop-carried dependency chain is through CF between the add -> adc instructions, and in RAX from the bottom of the loop back to the top. (The largest possible product has a high half less than 0xFF..., so the final adc rax, 0 can't wrap and produce its own carry-out. e.g. 0xFF * 0xFF = 0xFE01). This dependency chain is 5 cycles long.

(Experimentally, see comments, using lea instead of add for pointer increments and removing the adc rax, 0 to shorten it to 4 cycles isn't faster than this version using adc at the top of the loop.)

Minimal reproducible code: add-adc.asm

Full code on GitHub (only tested on Mac; older version was working on Ubuntu, although I haven't checked in a while). I'm observing this effect on this test:

build/benchmark_asm_x86_64 --benchmark_filter=mul_word/100 --benchmark_repetitions=3 --benchmark_report_aggregates_only=true

Update

I added a minimal reproducible example link (single ASM file). Tried to gather per-iteration metrics. Note that outer loop iteration count is much higher than inner loop (want to keep arrays small enough for L1 cache), hopefully that does not skew the numbers too much. Here's what I got:

add per inner loop iteration:
~ 6.88 cycles
~ 20.00 uops_issued.any
~ 4.83 uops_dispatched_port.port1

adc per inner loop iteration:
~ 6.12 cycles
~ 20.11 uops_issued.any
~ 4.34 uops_dispatched_port.port1

Perhaps `add` is sometimes getting scheduled to port 1, delaying the throughput critical path? That would be odd, usually uop scheduling is pretty good, but maybe with all those other `adc` instructions the queues for p0 and p6 get long enough? This [mcve] isn't really complete enough to copy/paste and try locally; if you have a more complete version I might try it locally on my Skylake. — Peter Cordes, Jan 24 '21 at 20:50
Separately from explaining this perf effect, have you tried using `lea rdi, [rdi+32]` / `lea rsi, [rsi+32]` so you don't have to save/restore the carry around the loop for this 1 x n limb multiply? [That's good on SnB-family, bad on P6-family (partial flag stall)](https://stackoverflow.com/questions/32084204/problems-with-adc-sbb-and-inc-dec-in-tight-loops-on-some-cpus). Then you could "justify" using `adc` there. — Peter Cordes, Jan 24 '21 at 20:56
@PeterCordes yep, warm-up ruled out. Separate binaries. Double checked code alignment (it's the same), checked L1 cache misses (zero misses for both; RSI and RDI only 3 kilobyte long for this test). Added GitHub link for the code (see bottom of the post). Yeah, I actually had it preserve CF with lea initially (which also allows to remove "adc rax, 0" from the loop body), but it was a bit slower that way. Thank you for suggestions and looking into it! Very curious what your results would be. — stepan, Jan 24 '21 at 21:22
LEA runs on p15, so possibly it was stealing port 1 cycles from MULX, like I'm guessing `add` might be. Might be interesting to try an instruction that can *only* run on port 1 so will always steal cycles from mulx, like `imul r8, rax` (although that has 3 cycle latency, like every other integer instruction that can only run on p1 on SnB-family). Perhaps interesting to drop the dependency on R8 to see what happens, like `lea r8, [rax+rax*2+1]` or `imul r8, rax, 7`. Or maybe try using more registers and schedule a couple mulx earlier, before a couple adc/store. — Peter Cordes, Jan 24 '21 at 22:54
@PeterCordes I added more details at the bottom of the post. `lea r8, [rax+rax*2+1]` makes it faster and even if I keep r8 dependency (`lea r8, [r8+rax*2+1]`) it's still just as fast (faster than adc/add versions). Tried rearranging `mulx` in front (2 mulx, then adds/stores) - add still slower than adc, original order with "adc" is a little faster than both rearranged versions. — stepan, Jan 25 '21 at 08:25
Huh, ok, opposite of what I guessed might happen. Too bad there isn't a 1-cycle-latency uop that can only run on port 1; would be a useful experiment. Perhaps there is something to do with add being able to compete for port 1, but maybe it's not stealign throughput from mulx, maybe it's mulx delaying critical-path latency? Or even a write-back conflict where 2 uops would try to produce a result at the same time on the same port, because of different latencies? (Although I thought the scheduler in the RS prevented that, and that's why SnB standardized uop latencies, why there are no 2c uops) — Peter Cordes, Jan 25 '21 at 08:52
There is a carried dep chain in this loop, isn't there? From `rax` at the bottom and then through `r8` mostly in the rest of the loop. Maybe that's the limiting factor? It would be good to see "cycles per iteration" timings. Also, it is strange to report the port pressure metrics "per cycle" - since one is faster than the other, it skews things. You'd rather something which does vary with perf, like per iteration, per instruction, whatever. — BeeOnRope, Jan 25 '21 at 09:23
@BeeOnRope correct, `rax` is dependency between loop iterations. Can't get rid of it, it's a large integer we're multiplying so need to carry to get correct result. I updated the post to include per loop iteration metrics, also added a link to single "asm" file executable, hopefully you can compile it now @PeterCordes. — stepan, Jan 25 '21 at 10:29
I guess the answer is as @Peter alluded to above: "scheduling". The way uops are scheduled on Intel CPUs is [opaque](https://stackoverflow.com/questions/40681331/how-are-x86-uops-scheduled-exactly) and not particular sophisticated (based on my observations). Any time a critical (or "near" critical) dependency chain exists, and there are other uops outside the chain competing for the same ports, it is common to see less than ideal throughput as non-chain ops steal ports from the critical chain. — BeeOnRope, Jan 30 '21 at 22:18
In that scenario, any kind of random change like swapping `add` for `adc` can improve or worsen the performance in random(ish) ways (because scheduling is opaque). In this case, it is possible that the `add/adc` is on the critical path, and restricting it to fewer ports actually happens to help its scheduling, or maybe it is off the critical path and the port restriction prevents it from stealing from the critical path, or perhaps some other "butterfly flaps its wings" effect. — BeeOnRope, Jan 30 '21 at 22:20
BTW, I had missed the fact that this was one long dep chain, not independent 256-bit multiplies started by an add, in my first few comments. I deleted a couple which weren't relevant, but the first one about port scheduling still has that misconception of more ILP. Yeah, when your right on the cusp of a latency bottleneck, as @Bee said any quirk of scheduling can easily steal cycles from it in a way that means it can't just catch up later. So it's basically black magic, no way I'm aware of to predict what will let the CPU settle into an optimal scheduling pattern vs. not. — Peter Cordes, Jan 30 '21 at 22:46
@PeterCordes something seems fishy about the timings: isn't there at least a 9 cycle dependecy chain through `rax` and `r8`, passing through the 1st and 3rd `mulx`? Then the updated timings of < 7 cycles seem "too good" unless the trip count for the loop is quite short and so multiple independent copies of the loop can be executing at once. — BeeOnRope, Jan 30 '21 at 23:05
@BeeOnRope: All the `mulx` instructions are independent: they read RDX (implict) and rsi->memory, and write the 2 explicit destination operands. I updated the question with comments and a paragraph about the dep chain so future readers don't have to sort through it. — Peter Cordes, Jan 30 '21 at 23:29
@PeterCordes - oh lol, serves me right for not looking up the `mulx` syntax. I thought it was `dest, src1, src2`. — BeeOnRope, Jan 31 '21 at 02:13
@BeeOnRope: Yeah, it's unique in having 2 explicit outputs unlike usual VEX, and only makes sense once you think about use-cases like this for why it's designed that way. I've had to look it up every time I've had a few months to forget about it. — Peter Cordes, Jan 31 '21 at 02:43
On icelake I was able to reorganize the code a tiny bit and get better performance with ```add``` than ```adc```: https://onlinegdb.com/7yYbRIFci. The reorganization was totally meaningless. I'm buying @BeeOnRope's explination that its some opaque scheduling thing. — Noah, Feb 14 '21 at 01:35
@Noah thanks! yep, your code is able to do 6.06 cycle/iter on my Coffee Lake (same as ADC). — stepan, Feb 14 '21 at 12:46

amonakov · Answer 1 · 2023-04-28T17:14:14.613

I think this is caused by suboptimal scheduling, taking extra hit from writeback conflict penalties.

Comparing uops.info measurements for 64-bit mulx vs. variants of imul we can reasonably conclude that mulx comprises one port 1 uop with latency 3 producing the low 64 bits of the result, plus one port 5 uop producing the high 64 bits one cycle later.

Consider what happens when a 3-cycle uop is issued on port 1 at cycle i, and a 1-cycle uop is pending. It cannot be issued at cycle i (the port cannot accept a second uop on that cycle). It also cannot be issued on cycle i+2: it would cause a writeback conflict on cycle i+3 on that port (it cannot produce results for two uops on the same cycle).

By constructing a loop with forced writeback conflicts, Fabian Giesen demonstrated there's apparently an extra penalty when that happens.

Hence if you have a mix of single-cycle and multi-cycle uops competing for the same port, it's a bit of a landmine. My initial attempt to solve that was to add a fifth mulx instruction in the loop (which outputs would be just discarded), so there's a constant pressure of multi-cycle uops on p1 and p5 and adds would be scheduled elsewhere. That worked to some degree, but it's possible to do even better!

The following measurements are from Skylake (i7-6700).

The baseline version runs at ~6.9 cycles per iteration:

    10,337,227,178      cycles:u
     4,283,434,673       uops_dispatched_port.port_0:u
     7,278,477,707       uops_dispatched_port.port_1:u
     6,849,168,521       uops_dispatched_port.port_5:u
     5,609,252,055       uops_dispatched_port.port_6:u

    10,384,026,044      cycles:u
     3,922,820,702       uops_dispatched_port.port_2:u
     4,024,756,546       uops_dispatched_port.port_3:u
     6,388,517,494       uops_dispatched_port.port_4:u
     4,087,151,242       uops_dispatched_port.port_7:u

Note that the number of store-data uops on port 4 is more than 5% higher than it should be. Some stores are being replayed, and this will have an increasingly confounding effect on the following measurements.

The variant with adc in place of the initial add runs at 6 cycles per iteration:

     8,877,983,794      cycles:u
     5,181,755,017       uops_dispatched_port.port_0:u
     6,525,640,349       uops_dispatched_port.port_1:u
     6,019,129,311       uops_dispatched_port.port_5:u
     6,295,528,774       uops_dispatched_port.port_6:u

     9,040,426,883      cycles:u
     3,762,317,354       uops_dispatched_port.port_2:u
     3,814,931,097       uops_dispatched_port.port_3:u
     7,292,924,631       uops_dispatched_port.port_4:u
     4,462,674,038       uops_dispatched_port.port_7:u

(note even higher count on port 4)

Now let's swap around increments of rsi and rdi (rsi is needed earlier). This brings a very noticeable improvement to 5.6 cycles per iteration:

     8,376,301,855      cycles:u
     5,129,834,339       uops_dispatched_port.port_0:u
     6,632,790,174       uops_dispatched_port.port_1:u
     6,088,383,045       uops_dispatched_port.port_5:u
     6,173,097,806       uops_dispatched_port.port_6:u

     8,404,972,940      cycles:u
     4,287,284,508       uops_dispatched_port.port_2:u
     4,317,891,165       uops_dispatched_port.port_3:u
     7,408,432,079       uops_dispatched_port.port_4:u
     3,429,913,047       uops_dispatched_port.port_7:u

(port 4 count goes higher yet again)

Now, if we think that single-cycle instructions issuing on port 1 and 5 are causing grief to the scheduler, can we figure out a way to force them on other ports somehow? Yes! Macro-fused predicted non-taken add-jcc pair can issue only on ports 0 and 6. So let's add jz .+2 after add rsi, 32 (note, Skylake JCC erratum could bite me here, but luckily our loop starts at 16 bytes mod 32, so we avoid the problematic crossing):

     8,339,935,074      cycles:u
     4,749,784,934       uops_dispatched_port.port_0:u
     6,429,206,233       uops_dispatched_port.port_1:u
     6,045,479,355       uops_dispatched_port.port_5:u
     6,798,134,405       uops_dispatched_port.port_6:u

     8,330,518,755      cycles:u
     4,250,791,971       uops_dispatched_port.port_2:u
     4,284,637,776       uops_dispatched_port.port_3:u
     7,379,587,531       uops_dispatched_port.port_4:u
     3,503,715,123       uops_dispatched_port.port_7:u

Oh no, our cycles barely budged! What if we finally bite the bullet and nop out the first store (nop dword ptr [rdi] in place of mov [rdi], r8):

     7,912,840,444      cycles:u
     4,648,297,975       uops_dispatched_port.port_0:u
     6,334,248,230       uops_dispatched_port.port_1:u
     6,059,133,113       uops_dispatched_port.port_5:u
     6,982,987,722       uops_dispatched_port.port_6:u

     7,843,194,695      cycles:u
     3,810,009,654       uops_dispatched_port.port_2:u
     3,790,523,332       uops_dispatched_port.port_3:u
     6,094,690,751       uops_dispatched_port.port_4:u
     2,938,959,057       uops_dispatched_port.port_7:u

That's 5.25 cycles per iteration. Applying the same jz .+2 trick to add rdi, 32, we get:

     7,630,926,523      cycles:u
     5,831,880,767       uops_dispatched_port.port_0:u
     6,004,354,504       uops_dispatched_port.port_1:u
     6,002,011,214       uops_dispatched_port.port_5:u
     6,183,831,282       uops_dispatched_port.port_6:u

     7,638,112,528      cycles:u
     3,751,406,178       uops_dispatched_port.port_2:u
     3,695,775,382       uops_dispatched_port.port_3:u
     5,638,534,896       uops_dispatched_port.port_4:u
     3,091,934,468       uops_dispatched_port.port_7:u

And that's below 5.1 cycles per iteration, where predicted throughput is 5 cycles per iteration. We see that ports 1 and 5 are occupied just by the mulx instructions. Restoring the nopped out store, I get 5.36 cycles per iteration:

     8,041,908,278      cycles:u
     5,599,216,700       uops_dispatched_port.port_0:u
     6,010,109,267       uops_dispatched_port.port_1:u
     6,002,448,325       uops_dispatched_port.port_5:u
     6,417,804,937       uops_dispatched_port.port_6:u

     8,050,871,976      cycles:u
     4,183,403,858       uops_dispatched_port.port_2:u
     4,166,022,265       uops_dispatched_port.port_3:u
     7,179,871,612       uops_dispatched_port.port_4:u
     3,695,465,024       uops_dispatched_port.port_7:u

It's unclear to me what's causing the replays.

ADD slower than ADC in the first step of a bigint multiply on Coffee Lake (Skylake)

1 Answers1