SSE Sum of multiplication of 4 32-bit integers

Question

Thanks to this post I found out how to multiply 4 32-bit integers.

What I want to do now is sum up the results. How can I do this using intrinsics? I've got access to SSE, SSE2 and AVX. My initial thoughts were to unload res into an int array and sum the first and third elements but I want to know if there is a better way.

This is what my code looks like

__m128i tmp1 = _mm_mul_epu32(a,b); /* mul 2,0*/
__m128i tmp2 = _mm_mul_epu32( _mm_srli_si128(a,4), _mm_srli_si128(b,4)); /* mul 3,1 */
__m128i res = _mm_unpacklo_epi32(_mm_shuffle_epi32(tmp1, _MM_SHUFFLE (0,0,2,0)), _mm_shuffle_epi32(tmp2, _MM_SHUFFLE (0,0,2,0)));

In regards to multiply, since you have AVX you only need to do `__m128i res = _mm_mullo_epi32(a, b)`. — Z boson, May 18 '15 at 08:36
Can you clarify what CPU families you are limited to ? SSE, SS2 and AVX only seems like an unlikely combination - are you sure you don't also have SSE3, SSSE3, SSE4, etc ? — Paul R, May 18 '15 at 08:42
SSE is pointless to mention since it does not support integer SIMD operations. — Z boson, May 18 '15 at 09:10
Strictly speaking SSE *does* have integer SIMD instructions, but only for 64 bit vectors, not 128 bits. "SSE" can also be a catch-all term for all the various SSE* instruction sets, so I think we can allow it here. ;-) — Paul R, May 18 '15 at 09:14

score 3 · Answer 1 · answered May 17 '15 at 17:07

3

If you just want a horizontal add, i.e. sum all the 4 32 bit int elements in the result vector, then you can just shift and add twice, then extract one element, e.g.:

__m128i vsum = _mm_add_epi32(v, _mm_srli_si128(v, 8));
vsum = _mm_add_epi32(vsum, _mm_srli_si128(vsum, 4));
int32_t sum = _mm_cvtsi128_si32(vsum);

answered May 17 '15 at 17:07

Paul R

208,748
37
389
560

You could use `_mm_hadd_epi32` twice but either method will wrap around overflow. To handle overflow you need to sign extend which is more complicated. – Z boson May 18 '15 at 08:35
True, but `_mm_hadd_epi32` is SSSE3 and OP claims that they only have SSE, SSE2 and AVX (which seems like an unlikely combination, admittedly, but who knows - maybe an AMD CPU without SSSE3 ?) Note also that if the OP had the full range of SSE instructions available they would presumably be using `_mm_mullo_epi32` rather than the above SSE2 multiplication method ? – Paul R May 18 '15 at 08:38
oh, I read his question differently. I though he had AVX and everything below not a solution for each. SSE does not support integer operations anyway. – Z boson May 18 '15 at 09:09
You may be right - he evidently doesn't have SSE4 though, for example, since he ignored the SSE4 implementation in the linked answer and used the SSE2 version. I'm not too familiar with AMD CPUs but I know that some of them lack SSSE3 and SSE4, so that would be my best guess (although I have no idea if any AMD CPUs support AVX ?). Anyway, I've asked the OP for clarification (see comment above). – Paul R May 18 '15 at 09:12
1

As I understand it, the AVX CPUID feature bit implies support for the VEX-encoded 128b version of every SSE, SSE2, SSE3, SSSE3, SSE4.1, and SSE4.2 instruction. Note how the Intel manual says `VAESDEC` requires `AVX & AES` feature flags, but `VPHADDD xmm1, xmm2, xmm3/m128` just says you need `AVX`, not `AVX & SSSE3`, for example. – Peter Cordes Jul 08 '15 at 21:18

SSE Sum of multiplication of 4 32-bit integers

1 Answers1