Old question, but complementing Nils Pipenbrinck's answer and responding to Goswin von Brederlow's request for an example of usage:
Consider the following C function:
void sum(uint128 a[2], uint128 b[2], uint128 sum[3])
{
uint128 a_sum = a[0] + a[1];
uint128 b_sum = b[0] + b[1];
sum[0] = a_sum;
sum[1] = b_sum;
sum[2] = a_sum + b_sum;
}
The compiler might compile this function as:
sum:
; Windows x64 calling convention:
; uint128 *a in RCX, *b in RDX, *sum in R8
push rsi ; non-volatile registers
push rdi
; clear CF and OF flags, carry-in = 0 to start both chains
xor eax,eax ; before loading data into RAX
; Or less efficient but doesn't destroy a register: test eax, eax
; sub rsp, 40 would also leave OF and CF=0
;; load all the data first for example purposes only,
;; into human-friendly pairs of registers like RDX:RCX
mov rax,r8 ; sum = rax
; r11:r10 = b[1] ;; standard notation is hi:lo
mov r11,[rdx+24] ; high half
mov r10,[rdx+16] ; low half
; r9:r8 = b[0]
mov r9,[rdx+8]
mov r8,[rdx]
; rdi:rsi = a[1]
mov rdi,[rcx+24]
mov rsi,[rcx+16]
; rdx:rcx = a[0] (overwriting the input pointers)
mov rdx,[rcx+8]
mov rcx,[rcx]
;;;;;; Now the ADCX/ADOX part
; compute a_sum in rdx:rcx
; compute b_sum in r9:r8
; Lower halves
; CF=0 and OF=0 thanks to an earlier instruction
adcx rcx, rsi ; a_sum.lo = a[0].lo + a[1].lo + 0
adox r8, r10 ; b_sum.lo = b[0].lo + b[1].lo + 0
; Higher halves
; CF and OF have carry-out from low halves
adcx rdx,rdi ; a_sum.hi = a[0].hi + a[1].hi + carry-in(CF)
adox r9,r11 ; b_sum.hi = b[0].hi + b[1].hi + carry-in(OF)
; nothing uses the CF and OF outputs, but that's fine.
; sum[0] = rdx:rcx
mov [rax],rcx
mov [rax+8],rdx
; sum[1] = r9:r8
mov [rax+16],r8
mov [rax+24],r9
;; Final addition the old fashioned way
; sum[2] = rdx:rcx (a_sum + b_sum)
add rcx,r8
adc rdx,r9
mov [rax+32],rcx
mov [rax+40],rdx
; clean up
pop rdi
pop rsi
ret
This isn't an optimization by itself; out-of-order execution can overlap one add/adc
with another add/adc
since both chains are very short. With multiple very large integers to add, there could be some benefit to interleaving the work with longer dependency chains of 20 or 40 or more operations that are a significant fraction of the CPU's out-of-order scheduler size (number of un-executed uops it can look ahead into for independent work).
If you were optimizing, you'd want memory source operands like adcx r10, [rcx]
/ adox r11, [rdx]
so you could use fewer registers (not needing to save/restore RSI and RDI), and fewer instructions (avoiding some separate mov
loads). So you'd probably have a load (and maybe a store) mixed in between each ADCX and ADOX instruction. We avoid that here only for illustration purposes, so uint128 values are in "obvious" pairs like r9:r8 instead of rax:r9 or something.
Unless data was coming from other operations, notably mulx
in a BigInt multiply, which is one of the intended use-cases for the ADX instructions. There, you're generating numbers that need to get added as you do the multiplies, and being able to interleave addition chains can sometimes avoid having to save more numbers for later, or do an adc reg, 0
to apply a carry and end a chain for now, or materializing a carry-out as a 0/1 integer with setc al
. (Intel whitepaper: New Instructions Supporting Large Integer Arithmetic.)