Which Intel microarchitecture introduced the ADC reg,0 single-uop special case?

Question

ADC on Haswell and earlier is normally 2 uops, with 2 cycle latency, because Intel uops traditionally could only have 2 inputs (https://agner.org/optimize/). Broadwell / Skylake and later have single-uop ADC/SBB/CMOV, after Haswell introduced 3-input uops for FMA and micro-fusion of indexed addressing modes in some cases.

(But BDW/SKL still uses 2 uops for the adc al, imm8 short-form encoding, or the other al/ax/eax/rax, imm8/16/32/32 short forms with no ModRM. More details in my answer.)

But adc with immediate 0 is special-cased on Haswell to decode as only a single uop. @BeeOnRope tested this, and included a check for this performance quirk in his uarch-bench: https://github.com/travisdowns/uarch-bench. Sample output from CI on a Haswell server showing a difference between adc reg,0 and adc reg,1 or adc reg,zeroed-reg.

(But only for 32 or 64-bit operand-size, not adc bl,0. So use 32-bit when using adc on a setcc result to combine 2 conditions into one branch.)

Same for SBB. As far as I've seen, there's never any difference between ADC and SBB performance on any CPU, for the equivalent encoding with the same immediate value.

When was this optimization for imm=0 introduced?

I tested on Core 2¹, and found that adc eax,0 latency is 2 cycles, same as adc eax,3. And also the cycle count is identical for a few variations of throughput tests with 0 vs. 3, so first-gen Core 2 (Conroe/Merom) doesn't do this optimization.

The easiest way to answer this is probably to use my test program below on a Sandybridge system, and see if adc eax,0 is faster than adc eax,1. But answers based on reliable documentation would be fine, too.

Footnote 1: I used this test program on my Core 2 E6600 (Conroe / Merom), running Linux.

;; NASM / YASM
;; assemble / link this into a 32 or 64-bit static executable.

global _start
_start:
mov     ebp, 100000000

align 32
.loop:

    xor  ebx,ebx  ; avoid partial-flag stall but don't break the eax dependency
%rep 5
    adc    eax, 0   ; should decode in a 2+1+1+1 pattern
    add    eax, 0
    add    eax, 0
    add    eax, 0
%endrep

    dec ebp       ; I could have just used SUB here to avoid a partial-flag stall
    jg .loop


%ifidn __OUTPUT_FORMAT__, elf32
   ;; 32-bit sys_exit would work in 64-bit executables on most systems, but not all.  Some, notably Window's subsystem for Linux, disable IA32 compat
    mov eax,1
    xor ebx,ebx
    int 0x80     ; sys_exit(0) 32-bit ABI
%else
    xor edi,edi
    mov eax,231   ; __NR_exit_group  from /usr/include/asm/unistd_64.h
    syscall       ; sys_exit_group(0)
%endif

Linux perf doesn't work very well on old CPUs like Core 2 (it doesn't know how to access all the events like uops), but it does know how to read the HW counters for cycles and instructions. That's sufficient.

I built and profiled this with

 yasm -felf64 -gdwarf2 testloop.asm
 ld -o testloop-adc+3xadd-eax,imm=0 testloop.o

    # optional: taskset pins it to core 1 to avoid CPU migrations
 taskset -c 1 perf stat -e task-clock,context-switches,cycles,instructions ./testloop-adc+3xadd-eax,imm=0

 Performance counter stats for './testloop-adc+3xadd-eax,imm=0':

       1061.697759      task-clock (msec)         #    0.992 CPUs utilized          
               100      context-switches          #    0.094 K/sec                  
     2,545,252,377      cycles                    #    2.397 GHz                    
     2,301,845,298      instructions              #    0.90  insns per cycle        

       1.069743469 seconds time elapsed

0.9 IPC is the interesting number here.

This is about what we'd expect from static analysis with a 2 uop / 2c latency adc: (5*(1+3) + 3) = 23 instructions in the loop, 5*(2+3) = 25 cycles of latency = cycles per loop iteration. 23/25 = 0.92.

It's 1.15 on Skylake. (5*(1+3) + 3) / (5*(1+3)) = 1.15, i.e. the extra .15 is from the xor-zero and dec/jg while the adc/add chain runs at exactly 1 uop per clock, bottlenecked on latency. We'd expect this 1.15 overall IPC on any other uarch with single-cycle latency adc, too, because the front-end isn't a bottleneck. (In-order Atom and P5 Pentium would be slightly lower, but xor and dec can pair with adc or add on P5.)

On SKL, uops_issued.any = instructions = 2.303G, confirming that adc is single uop (which it always is on SKL, regardless of what value the immediate has). By chance, jg is the first instruction in a new cache line so it doesn't macro-fuse with dec on SKL. With dec rbp or sub ebp,1 instead, uops_issued.any is the expected 2.2G.

This is extremely repeatable: perf stat -r5 (to run it 5 times and show average + variance), and multiple runs of that, showed the cycle count was repeatable to 1 part in 1000. 1c vs. 2c latency in adc would make a much bigger difference than that.

Rebuilding the executable with an immediate other than 0 doesn't change the timing at all on Core 2, another strong sign that there's no special case. That's definitely worth testing.

I was initially looking at throughput (with xor eax,eax before each loop iteration, letting OoO exec overlap iterations), but it was hard to rule out front-end effects. I think I finally did avoid a front-end bottleneck by adding single-uop add instructions. The throughput-test version of the inner loop looks like this:

    xor  eax,eax  ; break the eax and CF dependency
%rep 5
    adc    eax, 0   ; should decode in a 2+1+1+1 pattern
    add    ebx, 0
    add    ecx, 0
    add    edx, 0
%endrep

That's why the latency-test version looks kinda weird. But anyway, remember that Core2 doesn't have a decoded-uop cache, and its loop buffer is in the pre-decode stage (after finding instruction boundaries). Only 1 of the 4 decoders can decode multi-uop instructions, so adc being multi-uop bottlenecks on the front-end. I guess I could have just let that happen, with times 5 adc eax, 0, since it's unlikely that some later stage of the pipeline would be able to throw out that uop without executing it.

Nehalem's loop buffer recycles decoded uops, and would avoid that decode bottleneck for back-to-back multi-uop instructions.

How is this asking for a tool or documentation? This isn't documented anywhere, AFAIK. If you count Intel "publishing" the hardware itself, then any performance question is off topic. I *wish* this was documented in Agner Fog's microarch guide, but it isn't. That's why I'm asking. Would whoever downvoted be happier if I asked "how many uops in `adc eax,0` on Nehalem, SnB, and IvB?" Because that's the same question, and it's a request for a fact, not for documentation explaining it. — Peter Cordes, Aug 03 '18 at 05:36
Hmm. I've got an Ivy Bridge (i7-3630QM). However, it's running that *other* operating system. Fiddling with your code, I was able to get it to run on Windows, and I saw a clear difference between `adc eax, 0` and `adc eax, 1` (the zero running much faster). However, running that same code on my Kaby Lake box (i7-7700K), I see no difference at all. I'm trying to figure out if that means the `adc eax, 0` got slower, the `adc eax, 1` got faster, or my code is just mucked up. Is this what I should expect to see? — David Wohlferd, Aug 03 '18 at 08:08
@DavidWohlferd: Thanks! We already know that Broadwell / Skylake (including Kaby Lake which is the same uarch as SKL with physical improvements only) always run `adc r,imm` as a single uop, so no special case is needed. So it's definitely that the `adc eax,1` got faster, along with `adc eax,ebx` and `adc eax,[rsi]`. But not `adc [rdi], eax`; that's still a lot of uops because of [surprising microarchitectural reasons](https://stackoverflow.com/questions/17395557/observing-stale-instruction-fetching-on-x86-with-self-modifying-code#comment68191840_18388700): intra-instruction TLB consistency. — Peter Cordes, Aug 03 '18 at 08:15
Turns out I've also got a Nehalem (i7-820QM). I'm not seeing any difference here either. — David Wohlferd, Aug 03 '18 at 08:57
@DavidWohlferd: if you really want to confirm that `adc` on Nehalem is always slow rather than always fast, compare it against `add eax,1`. (Don't feel like you need to test that, though; if there's no difference between `adc eax,1` and `0` then we know it matches what Agner Fog found, and the well-known fact that `adc` in general is 2 uops on P6-family because of needing 3 inputs.) — Peter Cordes, Aug 03 '18 at 13:14

Andreas Abel · Accepted Answer · 2021-05-18T18:22:57.080

9

According to my microbenchmarks, the results of which can be found on uops.info, this optimization was introduced with Sandy Bridge (https://www.uops.info/html-tp/SNB/ADC_R64_0-Measurements.html). Westmere does not do this optimization (https://uops.info/html-tp/WSM/ADC_R64_0-Measurements.html). The data was obtained using a Core i7-2600, and a Core i5-650.

Furthermore, the data on uops.info shows that the optimization is not performed if an 8-bit register is used (Sandy Bridge, Ivy Bridge, Haswell).

edited May 18 '21 at 18:22

answered Oct 06 '18 at 14:05

Andreas Abel

1,376
1
10
21

Since you have access to a first-gen SnB, maybe you can clear up the mystery in [Is performance reduced when executing loops whose uop count is not a multiple of processor width?](https://stackoverflow.com/q/39311872). A 4 uop loop can issue at 1 per clock, but I found on SnB that a 7 uop loop can only run at 1 per 2 clocks, not ~1.75, at least when there's unlamination. But I didn't do more detailed tests and no longer have access to a SnB, so we don't know if SnB's loop buffer "unrolls" 5 to 7 uop loops to run them faster than 1 per 2 clocks like HSW does. – Peter Cordes Oct 06 '18 at 14:11
@PeterCordes - I was thinking about this recently, and it occurs to me that the behavior for very low uops (< 10) might be explained by the rule where apparently the "normal" taken branch throughput is only 1 per 2 cycles, and only "very small" loops can access a special behavior which allows 1 per cycle. So 7 uops (instructions?) might just be the point where the "very small" condition is violated. Maybe the "very small" thing is not even measured in uops or instructions, in instruction size or uop cache placement or something else, but still stops working at 7 for that test. – BeeOnRope Oct 10 '18 at 03:45
@BeeOnRope: I actually had the same thought the other day when writing that comment, that maybe taken-branch throughput became an issue somehow. – Peter Cordes Oct 10 '18 at 03:53

Peter Cordes · Answer 2 · 2023-03-10T20:55:11.663

5

It's not present on Nehalem, but is on IvyBridge. So it was new either in Sandybridge or IvB.

My guess is Sandybridge for this, because that was a major redesign of the decoders (producing up to 4 total uops, rather than patterns like 4+1+1+1 that were possible in Core2 / Nehalem), and hanging on to instructions that can macro-fuse (like add or sub) if they're the last in a group in case the next instruction is a jcc.

Significantly for this, I think SnB decoders also look at the imm8 in immediate-count shifts to check if it's zero, instead of only doing that in the execution units².

Hard data so far:

Broadwell and later (and AMD, and Silvermont/KNL) don't need this optimization, adc r,imm and adc r,r are always 1 uop. Except for the AL/AX/EAX/RAX imm no-modrm short form¹ being 2 uops before Alder Lake.
Haswell does this optimization: adc reg,0 is 1 uop, adc reg,1 is 2. For 32 and 64-bit operand-size, not 8-bit.
IvyBridge i7-3630QM does this optimization (thanks @DavidWohlferd).
Sandybridge ???
Nehalem i7-820QM does not, adc is slower than add regardless of the imm.
Core 2 E6600 (Conroe/Merom) doesn't either.
Safe to assume Pentium M and earlier don't.

Footnote 1: On Skylake, the al/ax/eax/rax, imm8/16/32/32 short-form encodings with no ModR/M byte still decode to 2 uops, even when the immediate is zero. For example, adc eax, strict dword 0 (15 00 00 00 00) is twice as slow as 83 d0 00. Both uops are on the critical path for latency.

Looks like Intel forgot to update the decoding for the other immediate forms of adc and sbb! (This all applies equally to both ADC and SBB.) They finally fixed this in Alder Lake P-cores (Golden Cove); https://uops.info/ tests adc AL,0 and adc AL, I8 separately from adc R8l, 0 and adc R8l, I8; same for r32. Intel mainstream CPUs before Ice/Tiger/Rocket Lake (including P6 family and Sandybridge) run adc al, 0 as 2 uops. (Low-power CPUs like Silvermont family run it as 1 uop.)

Assemblers will use the short-form by default for immediates that don't fit in an imm8, so for example adc rax, 12345 assembles to 48 15 39 30 00 00 instead of the one-byte larger single-uop form that is the only option for registers other than the accumulator.

A loop that bottlenecks on adc rcx, 12345 instead of RAX latency runs twice as fast. But adc rax, 123 is unaffected, because it uses the adc r/m64, imm8 encoding which is single uop.

Footnote 2: See INC instruction vs ADD 1: Does it matter? for quotes from Intel's optimization manual about Core2 stalling the front-end if a later instruction reads flags from a shl r/m32, imm8, in case the imm8 was 0. (As opposed to the implicit-1 opcode, which the decoder knows always writes flags.)

But SnB-family doesn't do that; the decoder apparently checks the imm8 to see whether the instruction writes flags unconditionally or whether it leaves them untouched. So checking an imm8 is something that SnB decoders already do, and could usefully do for adc to omit the uop that adds that input, leaving only adding CF to the destination.

edited Mar 10 '23 at 20:55

answered Aug 03 '18 at 13:11

Peter Cordes

328,167
45
605
847

3

"adc r,imm" actually is not always a 1-μop instr. on Broadwell and later: the "adc (AL|*AX), imm" special cases have two μops (see, e.g., http://uops.info/html-tp/SKL/ADC-2068-Measurements.html). IACA is also wrong about this: It claims that all "adc R8, imm" (not just the AL special case) have two μops (http://uops.info/html-tp/SKL/ADC-2043-IACA3.0.html). – Andreas Abel Oct 09 '18 at 15:34
@AndreasAbel: wow, that's weird. It applies to the EAX and RAX form, too, without a `66` prefix. Updated my answer. – Peter Cordes Oct 09 '18 at 15:53
It also applies to the same forms of sbb. – Andreas Abel Oct 09 '18 at 16:11
IACA also does not know about the 2 μops of the EAX and RAX cases: http://uops.info/html-tp/SKL/ADC-2069-IACA3.0.html, http://uops.info/html-tp/SKL/ADC-2070-IACA3.0.html – Andreas Abel Oct 09 '18 at 16:19
@AndreasAbel: good point, I should at least mention SBB. I assumed it was the same, but good to mention it explicitly. And yeah, IACA has lots of bugs like that. It also doesn't know about Haswell's un-lamination rules that allow keeping indexed addressing modes micro-fused in read-modify cases either, like `xor eax, [rsi + rdi*4]`. Maybe it will get updated at some point; Intel did update their optimization manual soon after I commented on https://bugs.llvm.org/show_bug.cgi?id=36180, so maybe that info will find its way into IACA. – Peter Cordes Oct 09 '18 at 16:25
1

My feeling about IACA was that Intel should open source it, because improvements and very slow in coming only from the "inside" and the combined knowledge and of various interested parties seems larger than what is embedded in IACA and it seems that people would be willing to update it. Now, however, we have [OSACA](https://github.com/RRZE-HPC/OSACA) from the maker of likwid (so you know it will be quality software). I'm just going to use and recommend that going forward over IACA, assuming the authors are willing to accept PRs for stuff like this. – BeeOnRope Oct 10 '18 at 02:29
@BeeOnRope: Yeah, open-source IACA would be a big improvement. The only excuse for it being closed source would be if it really modelled the pipeline accurately enough for Intel to be worried about trade secrets or something. But it definitely doesn't do that. I'll have to check out OSACA sometime, thanks for the link. – Peter Cordes Oct 10 '18 at 02:34
@PeterCordes - I haven't used it a lot, but it seems basically like an open source IACA in terms of its key functionality. It supports the IACA byte markings even. – BeeOnRope Oct 10 '18 at 02:35
1

@AndreasAbel - really interesting find about eax forms of `adc` and `sbb`. I added it to my list of [Intel Perf Quirks](https://github.com/travisdowns/uarch-bench/wiki/Intel-Performance-Quirks#short-form-adc-and-sbb-using-the-accumulator-rax-eax-ax-al-are-two-uops-on-broadwell-and-skylake). BTW never saw `uops.info` until now. Looks awesome! I didn't fully understand why this 2-uop "bug" doesn't byte for the `imm8` immediates usually. Is the `eax` special case not shorter in that case? – BeeOnRope Oct 10 '18 at 02:40
1

@BeeOnRope: `adc eax, imm32` is 5 bytes. `adc r/m32, imm8` is 3 bytes, so `adc eax, -128..127` will use the latter encoding with any decent assembler. The short-form encodings only save the ModRM byte, not enough to make up for the 3-byte diff between imm8 and imm32. I knew Intel sometimes let the `rep movs` microcode get out of date (suboptimal) on new uarches, but forgetting to update the hardwired decoding for some forms of an insn on Broadwell/Skylake seems really weird. I did check and `add bl, 0` is single-uop on SKL, as is `adc ecx, 12345`. – Peter Cordes Oct 10 '18 at 02:47
I see, the short form encodings always have `imm32`, not any smaller immediate. – BeeOnRope Oct 10 '18 at 02:54
1

@BeeOnRope: They always have an immediate of the same width as the register (except for rax). That's why recent edits about this say "al/ax/eax/rax, imm8/16/32/32". Maybe I should add "respectively" to those already-cluttered sentences. – Peter Cordes Oct 10 '18 at 02:56

Which Intel microarchitecture introduced the ADC reg,0 single-uop special case?

2 Answers2

Linked