Fastest way to sum dot product of vector of unsigned 64 bit integers using 192/256 bit integer?

Question

I need to calculate dot product of two vectors: uint64_t a[N], b[N]; (N<=60) containing 64-bit unsigned integers. Precisely this loop:

unsigned __int128 ans = 0;
for(int i=0;i<N;++i)
    ans += (unsigned __int128) a[i] * (unsigned __int128) b[i];

ans will overflow and thus result must be kept in a wide integer like 256 bit. But since N<=60 we can keep the result even in 160 (64*2 + 32) bit integer.

Slow Solutions:

Manually handling overflows:

unsigned __int128 ans = 0;
uint64_t overflow = 0;
for(int i=0;i<N;++i){
    auto temp = ans;
    ans += (unsigned __int128) a[i] * (unsigned __int128) b[i];
    if(ans<temp) overflow++;
}

This is slow because addition of if slows down the program ~ 2.2 times.

Using library like boost::multiprecision::uint256_t or GMP.

Probably Fast Solution: If we resort to assembly programming on 64 bit machine then addition can be performed using 3 64-bit registers by using add followed by adc and adc from lower to higher bits.

But I don't want to resort to ASM because it will be hard to maintain and it will not be portable.

My aim is to make it fast and maintainable.

EDIT: Peter points out in his comment that clang supports my idea for using adc while gcc still uses a branch (manual handling of overflow).

A slightly related discussion came up on another thread today wrt to dot product computation. The OP on [this question](https://stackoverflow.com/questions/59570753/why-is-c-executable-running-so-much-faster-when-linked-against-newer-libstdc) has about a half-dozen implementations that he benchmarked. And there's a link to his source code in his question as well. — selbie, Jan 03 '20 at 08:46
You can detect the overflow by simply masking the result and right shifting. You don't need an `if` statement. — user8469759, Jan 03 '20 at 08:46
@selbie: that other question is about `uint8_t` dot products, which can auto-vectorize with SIMD by converting to `float`. Even AVX512 doesn't have 64x64 => 128-bit integer multiply or packed 128-bit integer add. And there aren't built-in types that can hold the sum of two `__int128` without overflow, otherwise it would be easy-ish to get the compiler to make a chain of 3x adc. — Peter Cordes, Jan 03 '20 at 08:53
It's worth noting that optimal solutions might depend on Broadwell / Ryzen extensions implementing `ADCX` and `ADOX` instructions, in conjunction with the `MULX` instruction. This will avoid many of the flag register stalls that affect older micro-architectures. — Brett Hale, Jan 03 '20 at 09:14
@BrettHale: I don't think so; we have multiple small sums not one big one. OoO exec can interleave multiple short dep chains, even when those chains involve flags. Using 2 accumulators could help, but may not be necessary because `mov`-load + `mul [mem]` + add + 2x `adc` will bottleneck on the front-end. (`mul m64` does micro fuse the load and is only 2 uops, but even without loop overhead we have 6 uops). Zen might run it at 1 iter per clock, except that its `mul r/m64` has one per 2 clock throughput :/ Zen2 improves that to 1c https://www.uops.info/html-tp/ZEN2/MUL_M64-Measurements.html — Peter Cordes, Jan 03 '20 at 09:19
*But since N<=60 we can keep the result even in 96 (32*3) bit integer.* Huh? Even one 64x64 product is a full 128 bits wide, unless you know your 64-bit inputs always have a few high zeros? Did you mean 192 bits wide, 64*3? Or 160 bits for 64*2 + carry into a 32-bit variable? If 96 bits were sufficient, you could just use `unsigned __int128`. — Peter Cordes, Jan 03 '20 at 09:28
@PeterCordes It was a mistake. There must be `2*64+ ` bits to store the result. — madhur4127, Jan 03 '20 at 09:32
BTW, your title edit is a lot less explicit about how wide the integers are. It's not clear what "involving overflows" are; unsigned overflow is well-defined in C++ as wrapping around, so you wouldn't need to do anything. But you *don't* want overflow, you want extended-precision, emulating a 192-bit integer. And BTW, to actually require a 256-bit accumulator, you'd have to be doing dot products on arrays of size 2^128. Each addition can produce at most a 1 as carry-out, so your `overflow` counter (aka carry out, highest chunk) only needs to be able to hold numbers up to `N` — Peter Cordes, Jan 03 '20 at 09:32
@madhur4127 What I mean is that `overflow += (ans >> 64) & 1LL`, also since when it overflows the result is invalid you can just exit from the `for` loop. — user8469759, Jan 03 '20 at 09:36
@PeterCordes `overflow <=N`. This is the reason why `adc` makes it so much faster because of the `CF` (carry flag). Thus not requiring any if check. Unfortunately compilers don't support overflow checking for 128bit integers. — madhur4127, Jan 03 '20 at 09:39
clang recognizes the usual `sum < x` idiom for carry out of unsigned `x+y` even with 128-bit integers. https://godbolt.org/z/nZyQcD. GCC has major missed optimizations for `if()` or for `overflow += sum — Peter Cordes, Jan 03 '20 at 09:40
@user8469759: huh? The low bit of the top half of `ans` doesn't tell you anything. The OP is accumulating valid 128-bit products into a 160 or 192-bit accumulator, composed of 128-bit `ans` and a separate high chunk. `ans>>64` would tell you if a sum had overflowed the low 64 bits. — Peter Cordes, Jan 04 '20 at 09:17

score 2 · Answer 1 · answered Jan 03 '20 at 13:20

You definitely do not need a 256-bit accumulator. The sum of N integers can only produce at most N carry-outs, so a 64-bit overflow carry counter is sufficient for a dot product of vectors that are 2^64 elements long. i.e. the total width of your sum of 128-bit products only needs to be 192 = 64*3 or 160 = 128 + 32 bits. i.e. one extra register.

Yes, the optimal x86-64 asm for this is mov-load from on array, mul or mulx with a memory source from the other array, then add + adc into ans, and adc reg, 0 to accumulate the carry-out.

(Perhaps with some loop unrolling, possibly with 2 or more accumulators (sets of 3 registers). If unrolling for Intel CPUs, probably avoiding an indexed addressing mode for the mul memory source, so it can micro-fuse the load. GCC / clang unfortunately don't make loops that index one array relative to the other to minimize loop overhead (1 pointer increment) while also avoiding an indexed addressing mode for one of the arrays, so you aren't going to get optimal asm from a compiler. But clang is pretty good.)

clang9.0 generates that for your code with -O3 -march=broadwell -fno-unroll-loops. Clang recognizes the usual a<b idiom for carry-out of unsigned a+=b even with 128-bit integers, leading to add reg,reg / adc reg,reg / adc reg,0 (Unfortunately clang's loop-unrolling de-optimizes to mov ; setc ; movzx ; add instead of adc reg,0, defeating the benefit of loop unrolling! That's a missed-optimization bug that should get reported.)

GCC actually branches on your if(), which you can fix by writing it branchlessly as
overflow += sum<prod;. But that doesn't help with GCC's other major missed optimization: actually doing a 128-bit compare with cmp/sbb for sum < prod instead of just using the CF output of the last adc. GCC knows how to optimize that for uint64_t (Efficient 128-bit addition using carry flag) but apparently not for carry-out from __uint128_t.

It's probably not possible to hand-hold gcc into avoiding that missed-optimization by changing your source, that's a bug that GCC developers will have to fix in GCC. (So you should report it as a missed-optimization bug on their bugzilla https://gcc.gnu.org/bugzilla/; include a link to this Q&A). Trying to go full-manual with uint64_t chunks yourself would be even worse: the middle chunk has carry-in and carry-out. That's hard to write correctly in C, and GCC won't optimize it to adc.

Even using overflow += __builtin_add_overflow(sum, prod, &sum); doesn't fully help, giving us the same cmp/sbb/adc GCC code-gen! https://gcc.gnu.org/onlinedocs/gcc/Integer-Overflow-Builtins.html says "The compiler will attempt to use hardware instructions to implement these built-in functions where possible, like conditional jump on overflow after addition, conditional jump on carry etc." but apparently it just doesn't know how for 128-bit integers.

#include <stdint.h>

static const int N = 2048;
using u128 = unsigned __int128;
u128 dp(uint64_t a[], uint64_t b[], uint64_t *overflow_out)
{
    u128 sum = 0;
    uint64_t overflow = 0;
    for(int i=0;i<N;++i){
        u128 prod = a[i] * (unsigned __int128) b[i];
        //sum += prod;
        //overflow += sum<prod;       // gcc uses adc after a 3x mov / cmp / sbb; clang is still good
        overflow += __builtin_add_overflow(sum, prod, &sum);  // gcc less bad, clang still good
    }

    *overflow_out = overflow;
    return sum;
}

Clang compiles this nicely (Godbolt):

# clang9.0 -O3 -march=broadwell -fno-unroll-loops  -Wall -Wextra
     ... zero some regs, copy RDX to R8 (because MUL uses that implicitly)

.LBB0_1:                                # =>This Inner Loop Header: Depth=1
        mov     rax, qword ptr [rsi + 8*rcx]
        mul     qword ptr [rdi + 8*rcx]
        add     r10, rax
        adc     r9, rdx                 # sum += prod
        adc     r11, 0                  # overflow += carry_out
        inc     rcx
        cmp     rcx, 2048
        jne     .LBB0_1               # }while(i < N);

        mov     qword ptr [r8], r11   # store the carry count
        mov     rax, r10
        mov     rdx, r9               # return sum in RDX:RAX
        ret

Note that you don't need ADOX / ADCX or MULX to make this efficient. Out-of-order exec can interleave multiple short FLAGS dependency chains.

You could use another 3 registers for another 192-bit accumulator (sum and carryout), but that's probably unnecessary.

This looks like 9 uops for the front-end (assuming mul can't keep an indexed addressing mode micro-fused (unlamination), but that cmp/jcc will macro-fuse) so it will run at best 1 multiply per 2.25 clock cycles. Haswell and earlier run adc reg,reg as 2 uops, but Broadwell improved that 3-input instruction (FLAGS + 2 regs) to 1 uop. adc reg,0 is special-cased as 1 uop on SnB-family.

The loop-carried dependency chains are only 1 cycle long each: add into r10, adc into r9, and adc into r11. The FLAGS input for those ADC instructions are part of a short non-loop-carried dependency chain that out-of-order execution will eat for breakfast.

Ice Lake with its 5-wide pipeline will run this somewhat faster, like maybe 1.8 cycles per iteration, assuming it still unlaminates mul's memory operand.

Zen has a pipeline that's 5 instructions wide, or 6 uops wide if it includes any multi-uop instructions. So it might run this at 2 cycles per iteration, bottlenecked on its 2c throughput for full multiply. (Normal imul r64, r64 non-widening multiply is 1/clock, 3c latency on Zen, same as Intel.)

Zen 2 speeds up mul m64 and mulx to 1/clock (https://www.uops.info/html-tp/ZEN2/MUL_M64-Measurements.html), and might be the fastest CPU for this loop on a clock-for-clock basis.

With some unrolling and indexing one array relative to the other, a hand-optimized version of this could approach a cost per multiply of 6 uops = mov-load (1 uop) + mul mem (2 uops) + add + 2x adc (3 uops), so on Broadwell about 1.5 cycles / multiply.

It's still going to bottleneck on the front-end, and minimum loop overhead of 2 uops (pointer increment, and cmp/jcc) means an unroll by 4 could give us 4 multiplies per 6.5 cycles instead of per 6 cycles, or 92% of the fully-unrolled theoretical max. (Assuming no stalls for memory or sub-optimal scheduling, or at least that out-of-order exec is sufficient to absorb that and not stall the front-end. The back-end should be able to keep up with the front-end here, on Haswell and later, so the ROB should have room to absorb some minor bubbles, but probably not L3 or TLB misses.)

I don't think there's any benefit to using SIMD here; the widest element size for adds is 64-bit, and even AVX512 only has 64x64 => 128-bit multiply. Unless you can convert to double in which case AVX512 could run this very much faster.

I filed a bug in GCC regarding the missed optimization: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=93141 — madhur4127, Jan 03 '20 at 14:01
Can you explain this statement: `gcc doesn't actually *branch* unless you use an if()`. AFAIK branch is a comparsion which could be followed by a jump. Reference: See Section *Conditional Jumps: Branching in Assembly* https://www.cs.uaf.edu/2012/fall/cs301/lecture/09_12_jmp.html — madhur4127, Jan 03 '20 at 14:19
I am still in university and there's a lot to learn about assembly programming :/ Thanks for the comment. Also, I think you should file a bug in clang as well. I might miss the details. But please share the BUG ID here. — madhur4127, Jan 03 '20 at 14:21
@madhur4127: No, a branch is a conditional-jump instruction (or any jump depending on context). Specifically the [JCC instruction](https://www.felixcloutier.com/x86/jcc). compare-and-branch is a common pattern, but for example you could `add` ; `jc` to branch on carry-out from an add. Anyway, `cmp` is not a branch; it doesn't do anything weird to RIP (the program counter). It just writes FLAGS. In our case, the instruction that consumes those flags is SBB, not JCC. — Peter Cordes, Jan 03 '20 at 14:26
@PeterCordes - with `N <= 60`, would there be any gains in using a low 8-bit register for the high carry? I can't imagine there's any performance gain, but perhaps there's a more compact instruction encoding? I can hardly remember any x86 instruction encoding... — Brett Hale, Jan 04 '20 at 09:06
@BrettHale: You might save 1 instruction byte by using 2-byte `adc al,0` (the special no-modrm encoding) instead of 3-byte `adc reg, 0`. But then you'd have to adjust your calling convention to return the high part in EAX instead of returning the 128-bit low part in RDX:RAX, and probably use `mulx` instead of `mul` so RAX isn't an implicit operand, otherwise your total code-size gains are eaten up with `mov` instructions. Also, `adc al,0` doesn't benefit from [the 1-uop special case on SnB through HSW](//stackoverflow.com/q/51664369). xor-zeroing EAX does avoid partial-reg problems though. — Peter Cordes, Jan 04 '20 at 09:12
@PeterCordes - yeah, I was just hoping it provided a loop-hole of some sort... The reason I mentioned `ADCX` / `ADOX` was partitioning the workload between `MULX/ADCX` and `MULX/ADOX` halves and adding the totals. Since `N` isn't hardcoded, there are setup issues (e.g., `N` is odd). But it occurs to me that I don't have enough understanding of micro-architectural uop / port issues... — Brett Hale, Jan 04 '20 at 09:43
@BrettHale: I had the same thought initially about whether MULX would help, but then I realized that the dep chains through FLAGS are short and independent. MULX + ADOX/CX are good for one large multiply where all the additions are part of one long chain of carries (or 2 with ADX). But here we have each chain of carries ending after add/adc/`adc 0`, very much short enough for out-of-order exec to interleave it for us across a couple iterations of multiplies. You'd only need ADCX/ADOX if you were software-pipelining for an in-order CPU, manually interleaving stuff. — Peter Cordes, Jan 04 '20 at 09:49

Fastest way to sum dot product of vector of unsigned 64 bit integers using 192/256 bit integer?

1 Answers1