18

I was looking through the Intel software developer manual when I encountered the ADCX instruction, which was previously unknown to me; its encoding is 66 0F 38 F6. It seems to be almost identical to the ADC instruction, so why would you use ADCX when:

  • it is only supported in modern CPUs
  • instruction encoding takes more space (4 bytes versus 1 for ADC)

Is there some other side effect, or special case, where ADCX proves advantageous over ADC? There must have been some good reason why this was added to the instruction repertoire.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
John Källén
  • 7,551
  • 31
  • 64

3 Answers3

27

These instructions are used to speed up large integer arithmetic.

Before these instructions adding large numbers often resulted in a code-sequence that looked like this:

  add  
  adc  
  adc  
  adc  
  adc  

The important part here to note is, that if the result of an addition would not fit into a machine word, the carry flag gets set and is 'carried over' to the next higher machine word. All these instructions depend on each other because they take the previous addition carry flag into account and generate a new carry flag value after execution.

Since x86 processors are capable to execute several instructions at once this became a huge bottleneck. The dependency chain just made it impossible for the processor to execute any of the arithmetic in parallel. (to be correct in practice you'll find a loads and stores between the add/adc sequences, but the performance was still limited by the carry dependency).

To improve upon this Intel added a second carry chain by reinterpreting the overflow flag.

The adc instruction got two news variants: adcx and adox

adcx is the same as adc, except that it does not modify the OF (overflow) flag anymore.

adox is the same as adc, but it stores the carry information in the OF flag. It also does not modify the carry-flag anymore.

As you can see the two new adc variants don't influence each other with regards to the flag usage. This allows you to run two long integer additions in parallel by interleaving the instructions and use adcx for one sequence and adox for the other sequence.

Nils Pipenbrinck
  • 83,631
  • 31
  • 151
  • 221
  • 1
    I've read a lot about these instructions but never understood how they actually benefit performance or what considerations there for their use until this answer. Everything else appears to just say "oh, the old one changes flags; this one doesn't", which made me think it was `rorx` but for `adc` and somehow that was actually a good thing. – jeteon Nov 17 '17 at 19:37
  • 1
    I think you missed a point, naturally, this chain of operations is written in a loop and the increment of the loop variable changes the carry flag. Thus this requires saving the carry flag, incrementing loop counter, setting carry flag, perform `adc,adc,adc...`. – madhur4127 May 03 '20 at 12:57
  • What is missing from this answer is an example how to use `adcx`/`adox`. Specifically how do you start the chains? What do you have to change the initial `add` into? – Goswin von Brederlow Jun 12 '22 at 19:15
  • @madhur4127 Are you saying that you could use `adox` in a loop and avoid having to preserve flags across the loop iterations? – Goswin von Brederlow Jun 12 '22 at 19:17
  • "This allows you to run two long integer additions in parallel by interleaving the instructions and use adcx for one sequence and adox for the other sequence." So we can only speed up addition 2 times? – phqb Mar 28 '23 at 07:55
  • *The dependency chain just made it impossible for the processor to execute any of the arithmetic in parallel.* - out-of-order exec solve that problem for short dep chains like this (320-bit integers on a 64-bit machine with a chain of 5 additions). – Peter Cordes Jun 07 '23 at 13:10
  • 1
    @madhur4127: You have it backwards. `dec ecx / jnz` modifies all FLAGS *except* CF, making it perfect for use with `adc`. (Intel since Broadwell or Skylake doesn't have any partial-flag stalls or merging overhead. Intel before Sandybridge had significant stalls in that case, making it hard to write good `adc` loops. [Problems with ADC/SBB and INC/DEC in tight loops on some CPUs](https://stackoverflow.com/q/32084204) – Peter Cordes Jun 07 '23 at 13:12
12

To quote from the paper New Instructions Support Large Integer Arithmetic by Intel:

The adcx and adox instructions are extensions of the adc instruction, designed to support two separate carry chains. They are defined as:

adcx dest/src1, src2
adox dest/src1, src2

Both instructions compute the sum of src1 and src2 plus a carry-in and generate an output sum dest and a carry-out. The difference between these two instructions is that adcx uses the CF flag for the carry in and carry out (leaving the OF flag unchanged), whereas the adox instruction uses the OF flag for the carry in and carry out (leaving the CF flag unchanged).

The primary advantage of these instructions over adc is that they support two independent carry chains.

The paper shows examples of using it in combination with BMI2 mulx (no-flags widening multiply) for 512-bit BigInt multiply to handle the partial products being generated.

Part of the benefit is not just parallel dependency chains for out-of-order exec, it's being able to use results closer to where they're generated so you don't run out of registers (or need extra mov instructions) to save them for adding later. Or without ADX instructions, sometimes their examples will terminate a chain of add/adc early with adc reg, 0, to pick it back up later after doing a different add.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Jester
  • 56,577
  • 4
  • 81
  • 125
  • The question was about adc/adcx, not between adox/adcx. So far, adcx sounds exactly like adc. – Seva Alekseyev Apr 20 '15 at 12:31
  • 15
    @SevaAlekseyev the classic `adc` modifies all the flags, in particular **both `CF` and `OF`**. The new instructions only modify a single flag: `adcx` only `CF`, adox only `OF`. – Jester Apr 20 '15 at 12:34
2

Old question, but complementing Nils Pipenbrinck's answer and responding to Goswin von Brederlow's request for an example of usage:

Consider the following C function:

void sum(uint128 a[2], uint128 b[2], uint128 sum[3])
{
    uint128 a_sum = a[0] + a[1];
    uint128 b_sum = b[0] + b[1];

    sum[0] = a_sum;
    sum[1] = b_sum;
    sum[2] = a_sum + b_sum;
}

The compiler might compile this function as:

sum:
; Windows x64 calling convention:
; uint128 *a in RCX,  *b in RDX,  *sum in R8

  push    rsi  ; non-volatile registers
  push    rdi

; clear CF and OF flags, carry-in = 0 to start both chains
  xor     eax,eax    ;  before loading data into RAX
;  Or less efficient but doesn't destroy a register:  test eax, eax
;  sub rsp, 40   would also leave OF and CF=0

;; load all the data first for example purposes only,
;; into human-friendly pairs of registers like RDX:RCX
  mov     rax,r8   ; sum = rax

; r11:r10 = b[1]        ;; standard notation is hi:lo
  mov     r11,[rdx+24]  ; high half
  mov     r10,[rdx+16]  ; low half
; r9:r8 = b[0]
  mov     r9,[rdx+8]
  mov     r8,[rdx]
; rdi:rsi = a[1]
  mov     rdi,[rcx+24]
  mov     rsi,[rcx+16]
; rdx:rcx = a[0]        (overwriting the input pointers)
  mov     rdx,[rcx+8]
  mov     rcx,[rcx]

;;;;;; Now the ADCX/ADOX part
; compute a_sum in rdx:rcx
; compute b_sum in r9:r8
; Lower halves
  ; CF=0 and OF=0 thanks to an earlier instruction
  adcx    rcx, rsi     ; a_sum.lo = a[0].lo + a[1].lo + 0
  adox    r8, r10      ; b_sum.lo = b[0].lo + b[1].lo + 0
; Higher halves
  ; CF and OF have carry-out from low halves
  adcx    rdx,rdi      ; a_sum.hi = a[0].hi + a[1].hi + carry-in(CF)
  adox    r9,r11       ; b_sum.hi = b[0].hi + b[1].hi + carry-in(OF)
  ; nothing uses the CF and OF outputs, but that's fine.

; sum[0] = rdx:rcx
  mov     [rax],rcx
  mov     [rax+8],rdx
; sum[1] = r9:r8
  mov     [rax+16],r8
  mov     [rax+24],r9

;; Final addition the old fashioned way
; sum[2] = rdx:rcx (a_sum + b_sum)
  add     rcx,r8
  adc     rdx,r9
  mov     [rax+32],rcx
  mov     [rax+40],rdx

; clean up
  pop     rdi
  pop     rsi
  ret

This isn't an optimization by itself; out-of-order execution can overlap one add/adc with another add/adc since both chains are very short. With multiple very large integers to add, there could be some benefit to interleaving the work with longer dependency chains of 20 or 40 or more operations that are a significant fraction of the CPU's out-of-order scheduler size (number of un-executed uops it can look ahead into for independent work).

If you were optimizing, you'd want memory source operands like adcx r10, [rcx] / adox r11, [rdx] so you could use fewer registers (not needing to save/restore RSI and RDI), and fewer instructions (avoiding some separate mov loads). So you'd probably have a load (and maybe a store) mixed in between each ADCX and ADOX instruction. We avoid that here only for illustration purposes, so uint128 values are in "obvious" pairs like r9:r8 instead of rax:r9 or something.

Unless data was coming from other operations, notably mulx in a BigInt multiply, which is one of the intended use-cases for the ADX instructions. There, you're generating numbers that need to get added as you do the multiplies, and being able to interleave addition chains can sometimes avoid having to save more numbers for later, or do an adc reg, 0 to apply a carry and end a chain for now, or materializing a carry-out as a 0/1 integer with setc al. (Intel whitepaper: New Instructions Supporting Large Integer Arithmetic.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 2
    `xor eax,eax` is a much easier and more efficient way to clear CF and OF. Also, out-of-order execution can already overlap short dependency chains like 2 instructions long (`add`/`adc` for adding 128-bit integers on a 64-bit machine). So there's no performance benefit for this short example on real CPUs. Starting with `add` means you don't have to clear FLAGS ahead of time, since it only writes them. But yes, if you wanted to start two addition chains with ADCX/ADOX, you couldn't start start the CF chain with ADD since it might also set OF. – Peter Cordes Jun 07 '23 at 13:07
  • @PeterCordes you're completely right about clearing the CF and OF flags with xor. As for the example itself, I just wanted to demonstrate their use in a very basic example; both answers are missing examples. – random-user-3985747 Jun 07 '23 at 14:12
  • Right, it's a good to have an example. But the comments say `The CPU is able to perform "a[0] + a[1]" and "b[0] + b[1]" in parallel` which incorrectly implies that wouldn't be true for trivially short `add/adc` chains of the same length. You'd only start to have a problem if your `adc` chains were maybe 40 instructions long, then being able to interleave them in the source can help. ([Understanding the impact of lfence on a loop with two long dependency chains, for increasing lengths](https://stackoverflow.com/q/51986046)). – Peter Cordes Jun 07 '23 at 14:22
  • Also, the large amount of loading+storing means the actual interesting part is somewhat buried. (You might want to put the XOR-zeroing right before the ADCX/ADOX instructions). Some calling conventions can pass `uint128_t` in a pair of registers and return in RDX:RAX, like https://godbolt.org/z/s1fePojG7 . GCC/clang choose not to use ADCX/ADOX because it wouldn't help, but you could as an example to do `a*=2` in parallel with `b-=3` before adding them together, or whatever ops you think are interesting. – Peter Cordes Jun 07 '23 at 14:24
  • @PeterCordes edited the answer to clear up the flags with xor and added a note to the comment you mentioned. The calling convention in the example is Windows x64 (mentioned in the first comment) and we are also passing arrays, it's going to be by reference anyway. – random-user-3985747 Jun 07 '23 at 15:01
  • Yes, the function you wrote takes 3 pointers, and 7 different `uint128_t` numbers, so there's a lot going on. I intentionally constructed a minimal example that had some parallel work to do on two inputs. (Although in hindsight I should have taken 3 inputs, since `adcx`/`adox` don't have an immediate form.) Your function could use memory source operands for adcx/adox so you'd only have to load half the inputs with separate `mov` instructions. – Peter Cordes Jun 07 '23 at 15:07
  • @PeterCordes I don't use Stackoverflow much. I came across this question a few days ago while reading on these two instructions; since there were no examples around, I decided to provide one, but by all means, if you believe the answer could be improved, edit it to your heart's content. – random-user-3985747 Jun 07 '23 at 15:08
  • Also, `mov rax, r8` isn't useful, you can just use stuff like `mov [r8],rcx` later, leaving the `uint128_t *sum` in R8. The return type is `void` so it doesn't matter what you leave in RAX. (Unlike returning a large struct, where I think Windows x64 handles it like most other conventions: the caller passes a pointer as the first arg, and the function stores the output there as well as returning the pointer in RAX.) – Peter Cordes Jun 07 '23 at 15:08
  • Ok, I'll make an edit to improve what I can without changing the C function. – Peter Cordes Jun 07 '23 at 15:09
  • `Your function could use memory source operands for adcx/adox so you'd only have to load half the inputs with separate mov instructions.` I did that on purpose to make the r64:r64 pairs more distinguishable to the reader (each uint128 is a r64:r64 pair), which is why I moved r8 to rax, even though we don't return anything. – random-user-3985747 Jun 07 '23 at 15:15
  • That makes sense. As soon as I started to edit the way I'd originally been thinking, I realized it would get messy, and we'd have pairs like R10:RAX, and that your way allowed overwriting the input pointers when loading both halves of one input, something that wouldn't work with memory source operands for all four ADCX/ADOX instructions. I did tighten up some instructions (like no point in doing anything with RBP; even MSVC in debug mode doesn't waste instructions on RBP). – Peter Cordes Jun 07 '23 at 16:18
  • Also, the standard notation for a register pair puts the high half first, like a RDX:RAX `mul` result, same place-value order as `0x1234` having the more significant digits first. Anyway, I made an edit, including adding text into the actual answer instead of just saying to see comments. (BTW, having many short dependency chains would *not* be a use-case for this. It's only long dependency chains that makes interleaving useful.) – Peter Cordes Jun 07 '23 at 16:20
  • BTW, actual compiler output for your function: https://godbolt.org/z/efE5bba8j - half the mov loads are gone, folded into source operands for `add`/`adc`. But less readable, even if disentangled and commented. – Peter Cordes Jun 07 '23 at 16:30
  • This one was a bad edit: `adcx rax, rsi`, rcx is the correct register here. And you're right about the notation for pairs, sometimes I write them as a little endian value (I find it easier to avoid mistakes under a debugger). – random-user-3985747 Jun 07 '23 at 16:44
  • Oops, thanks. I'd started to redo register allocation along with other edits, and missed that while reverting after seeing your comment and realizing it was a bad idea. – Peter Cordes Jun 07 '23 at 16:46