You definitely do not need a 256-bit accumulator. The sum of N
integers can only produce at most N
carry-outs, so a 64-bit overflow
carry counter is sufficient for a dot product of vectors that are 2^64 elements long. i.e. the total width of your sum of 128-bit products only needs to be 192 = 64*3 or 160 = 128 + 32 bits. i.e. one extra register.
Yes, the optimal x86-64 asm for this is mov
-load from on array, mul
or mulx
with a memory source from the other array, then add
+ adc
into ans
, and adc reg, 0
to accumulate the carry-out.
(Perhaps with some loop unrolling, possibly with 2 or more accumulators (sets of 3 registers). If unrolling for Intel CPUs, probably avoiding an indexed addressing mode for the mul
memory source, so it can micro-fuse the load. GCC / clang unfortunately don't make loops that index one array relative to the other to minimize loop overhead (1 pointer increment) while also avoiding an indexed addressing mode for one of the arrays, so you aren't going to get optimal asm from a compiler. But clang is pretty good.)
clang9.0 generates that for your code with -O3 -march=broadwell -fno-unroll-loops
. Clang recognizes the usual a<b
idiom for carry-out of unsigned a+=b
even with 128-bit integers, leading to add reg,reg
/ adc reg,reg
/ adc reg,0
(Unfortunately clang's loop-unrolling de-optimizes to mov
; setc
; movzx
; add
instead of adc reg,0
, defeating the benefit of loop unrolling! That's a missed-optimization bug that should get reported.)
GCC actually branches on your if()
, which you can fix by writing it branchlessly as
overflow += sum<prod;
. But that doesn't help with GCC's other major missed optimization: actually doing a 128-bit compare with cmp/sbb
for sum < prod
instead of just using the CF output of the last adc
. GCC knows how to optimize that for uint64_t
(Efficient 128-bit addition using carry flag) but apparently not for carry-out from __uint128_t
.
It's probably not possible to hand-hold gcc into avoiding that missed-optimization by changing your source, that's a bug that GCC developers will have to fix in GCC. (So you should report it as a missed-optimization bug on their bugzilla https://gcc.gnu.org/bugzilla/; include a link to this Q&A). Trying to go full-manual with uint64_t
chunks yourself would be even worse: the middle chunk has carry-in and carry-out. That's hard to write correctly in C, and GCC won't optimize it to adc
.
Even using overflow += __builtin_add_overflow(sum, prod, &sum);
doesn't fully help, giving us the same cmp/sbb/adc
GCC code-gen! https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html says "The compiler will attempt to use hardware instructions to implement these built-in functions where possible, like conditional jump on overflow after addition, conditional jump on carry etc." but apparently it just doesn't know how for 128-bit integers.
#include <stdint.h>
static const int N = 2048;
using u128 = unsigned __int128;
u128 dp(uint64_t a[], uint64_t b[], uint64_t *overflow_out)
{
u128 sum = 0;
uint64_t overflow = 0;
for(int i=0;i<N;++i){
u128 prod = a[i] * (unsigned __int128) b[i];
//sum += prod;
//overflow += sum<prod; // gcc uses adc after a 3x mov / cmp / sbb; clang is still good
overflow += __builtin_add_overflow(sum, prod, &sum); // gcc less bad, clang still good
}
*overflow_out = overflow;
return sum;
}
Clang compiles this nicely (Godbolt):
# clang9.0 -O3 -march=broadwell -fno-unroll-loops -Wall -Wextra
... zero some regs, copy RDX to R8 (because MUL uses that implicitly)
.LBB0_1: # =>This Inner Loop Header: Depth=1
mov rax, qword ptr [rsi + 8*rcx]
mul qword ptr [rdi + 8*rcx]
add r10, rax
adc r9, rdx # sum += prod
adc r11, 0 # overflow += carry_out
inc rcx
cmp rcx, 2048
jne .LBB0_1 # }while(i < N);
mov qword ptr [r8], r11 # store the carry count
mov rax, r10
mov rdx, r9 # return sum in RDX:RAX
ret
Note that you don't need ADOX / ADCX or MULX to make this efficient. Out-of-order exec can interleave multiple short FLAGS dependency chains.
You could use another 3 registers for another 192-bit accumulator (sum and carryout), but that's probably unnecessary.
This looks like 9 uops for the front-end (assuming mul
can't keep an indexed addressing mode micro-fused (unlamination), but that cmp/jcc
will macro-fuse) so it will run at best 1 multiply per 2.25 clock cycles. Haswell and earlier run adc reg,reg
as 2 uops, but Broadwell improved that 3-input instruction (FLAGS + 2 regs) to 1 uop. adc reg,0
is special-cased as 1 uop on SnB-family.
The loop-carried dependency chains are only 1 cycle long each: add
into r10, adc
into r9, and adc
into r11. The FLAGS input for those ADC instructions are part of a short non-loop-carried dependency chain that out-of-order execution will eat for breakfast.
Ice Lake with its 5-wide pipeline will run this somewhat faster, like maybe 1.8 cycles per iteration, assuming it still unlaminates mul
's memory operand.
Zen has a pipeline that's 5 instructions wide, or 6 uops wide if it includes any multi-uop instructions. So it might run this at 2 cycles per iteration, bottlenecked on its 2c throughput for full multiply. (Normal imul r64, r64
non-widening multiply is 1/clock, 3c latency on Zen, same as Intel.)
Zen 2 speeds up mul m64
and mulx
to 1/clock (https://www.uops.info/html-tp/ZEN2/MUL_M64-Measurements.html), and might be the fastest CPU for this loop on a clock-for-clock basis.
With some unrolling and indexing one array relative to the other, a hand-optimized version of this could approach a cost per multiply of 6 uops = mov-load (1 uop) + mul mem (2 uops) + add + 2x adc (3 uops), so on Broadwell about 1.5 cycles / multiply.
It's still going to bottleneck on the front-end, and minimum loop overhead of 2 uops (pointer increment, and cmp/jcc) means an unroll by 4 could give us 4 multiplies per 6.5 cycles instead of per 6 cycles, or 92% of the fully-unrolled theoretical max. (Assuming no stalls for memory or sub-optimal scheduling, or at least that out-of-order exec is sufficient to absorb that and not stall the front-end. The back-end should be able to keep up with the front-end here, on Haswell and later, so the ROB should have room to absorb some minor bubbles, but probably not L3 or TLB misses.)
I don't think there's any benefit to using SIMD here; the widest element size for adds is 64-bit, and even AVX512 only has 64x64 => 128-bit multiply. Unless you can convert to double
in which case AVX512 could run this very much faster.