Thanks to this post I found out how to multiply 4 32-bit integers.
What I want to do now is sum up the results. How can I do this using intrinsics? I've got access to SSE, SSE2 and AVX. My initial thoughts were to unload res
into an int array and sum the first and third elements but I want to know if there is a better way.
This is what my code looks like
__m128i tmp1 = _mm_mul_epu32(a,b); /* mul 2,0*/
__m128i tmp2 = _mm_mul_epu32( _mm_srli_si128(a,4), _mm_srli_si128(b,4)); /* mul 3,1 */
__m128i res = _mm_unpacklo_epi32(_mm_shuffle_epi32(tmp1, _MM_SHUFFLE (0,0,2,0)), _mm_shuffle_epi32(tmp2, _MM_SHUFFLE (0,0,2,0)));