SSE instruction to sum 32 bit integers to 64 bit

Question

I'm looking for an SSE instruction which takes two arguments of four 32 bit integers in __m128i, computes sum of corresponding pairs and returns result as two 64 bit integers in __m128i.

Is there an instruction for this?

[Here is a solution for 64bit to 128bit for SSE, SSE+XOP, AVX2, AVX512](http://stackoverflow.com/questions/27923192/practical-bignum-avx-sse-possible/27978043#27978043). — Z boson, Nov 12 '15 at 08:02
Why do you want to do this? I understand why you would want 64b+64b+carry but not 32b+32b+carry. — Z boson, Nov 12 '15 at 08:04

score 6 · Answer 1 · edited Nov 12 '15 at 00:48

6

There are no SSE operations with carry. The way to do this is to first unpack the 32-bit integers (punpckldq/punpckhdq) into 4 groups of 64-bit integers by using an all-zeroes helper vector, and then use 64-bit pairwise addition.

edited Nov 12 '15 at 00:48

answered Nov 12 '15 at 00:45

pmdj

22,018
3
52
103

3

SSE4.1 has some integer widening instructions that makes this slightly easier and faster. – Mysticial Nov 12 '15 at 01:17
1

@Mysticial: For signed integers, it's actually a *lot* easier and faster with `pmovsx`. It's not as big as I thought at first, since I had a pretty good idea while writing my answer for unpacking with a sign-mask, instead of unpacking and *then* blending a sign mask. But `pmovsx` is very nice if you're loading from memory, otherwise you have to work to get the upper half shifted over to prep for sign-extending it. – Peter Cordes Nov 12 '15 at 02:50

Peter Cordes · Answer 2 · 2015-11-12T15:58:01.890

SSE only has this for byte->word and word->dword. (pmaddubsw (SSSE3) and pmaddwd (MMX/SSE2), which vertically multiply v1 * v2, then horizontally add neighbouring pairs.)

I'm not clear on what you want the outputs to be. You have 8 input integers (two vectors of 4), and 2 output integers (one vector of two). Since there's no insn that does any kind of 32+32 -> 64b vector addition, let's just look at how to zero-extend or sign-extended the low two 32b elements of a vector to 64b. You can combine this into whatever you need, but keep in mind there's no add-horizontal-pairs phaddq, only vertical paddq.

phaddd is similar to what you want, but without the widening: low half of the result is the sum of horizontal pairs in the first operand, high half is the sum of horizontal pairs in the second operand. It's pretty much only worth using if you need all those results, and you're not going to combine them further. (i.e. it's usually faster to shuffle and vertical add instead of running phadd to horizontally sum a vector accumulator at the end of a reduction. And if you're going to sum everything down to one result, do normal vertical sums until you're down to one register.) phaddd could be implemented in hardware to be as fast as paddd (single cycle latency and throughput), but it isn't in any AMD or Intel CPU.

Like Mysticial commented, SSE4.1 pmovzxdq / pmovsxdq are exactly what you need, and can even do it on the fly as part of a load from a 64b memory location (containing two 32b integers).

SSE4.1 was introduced with Intel Penryn, 2nd gen Core2 (45nm die shrink core2), the generation before Nehalem. Falling back to a non-vector code path on CPUs older than that might be ok, depending on how much you care about not being slow on CPUs that are already old and slow.

Without SSE4.1:

Unsigned zero-extension is easy. Like pmdj answered, just use punpck* lo and hi to unpack with zero.

If your integers are signed, you'll have to do the sign-extension manually.

There is no psraq, only psrad (Packed Shift Right Arithmetic Dword) and psraw. If there was, you could unpack with itself and then arithmetic right shift by 32b.

Instead, we probably need to generate a vector where each element is turned into its sign bit. Then blend that with an unpacked vector (but pblendw is SSE4.1 too, so we'd have to use por).

Or better, unpack the original vector with a vector of sign-masks.

# input in xmm0
movdqa    xmm1, xmm0
movdqa    xmm2, xmm0
psrad     xmm0, 31     ; xmm0 = all-ones or all-zeros depending on sign of input elements.  xmm1=orig ; xmm2=orig
                       ; xmm0 = signmask;  xmm1=orig  ; xmm2=orig
punpckldq xmm1, xmm0   ; xmm1 = sign-extend(lo64(orig))
punpckhdq xmm2, xmm0   ; xmm2 = sign-extend(hi64(orig))

This should run with 2 cycle latency for both results on Intel SnB or IvB. Haswell and later only have one shuffle port (so they can't do both punpck insns in parallel), so xmm2 will be delayed for another cycle there. Pre-SnB Intel CPUs usually bottleneck on the frontend (decoders, etc) with vector instructions, because they often average more than 4B per insn.

Shifting the original instead of the copy shortens the dependency chain for whatever produces xmm0, for CPUs without move elimination (handling mov instructions at the register-rename stage, so they're zero latency. Intel-only, and only on IvB and later.) With 3-operand AVX instructions, you wouldn't need the movdqa, or the 3rd register, but then you could just use vpmovsx for the low64 anyway. To sign-extend the high 64, you'd probably psrldq byte-shift the high 64 down to the low 64.

Or movhlps or punpckhqdq self,self to use a shorter-to-encode instruction. (or AVX2 vpmovsx to a 256b reg, and then vextracti128 the upper 128, to get both 128b results with only two instructions.)

Unlike GP-register shifts (e.g. sar eax, 31) , vector shifts saturate the count instead of masking. Leaving the original sign bit as the LSB (shifting by 31) instead of a copy of it (shifting by 32) works fine, too. It has the advantage of not requiring a big comment in with the code explaining this for people who would worry when they saw psrad xmm0, 32.

SSE instruction to sum 32 bit integers to 64 bit

2 Answers2