Using the set intrinsics such as _mm_set_epi32
for all elements is inefficient. It's better to use the load intrinsics. See this discussion for more on that Where does the SSE instructions outperform normal instructions . If the arrays are 16 byte aligned you can use either _mm_load_si128
or _mm_loadu_si128
(for aligned memory they have nearly the same efficiency) otherwise use _mm_loadu_si128
. But aligned memory is much more efficient. To get aligned memory I recommend _mm_malloc
and _mm_free
, or C11 aligned_alloc
so you can use normal free
.
To answer the rest of your question, lets assume you have your two vectors loaded in SSE registers __m128i a
and __m128i b
For SSE version >=SSE4.1 use
_mm_mullo_epi32(a, b);
Without SSE4.1:
This code is copied from Agner Fog's Vector Class Library (and was plagiarized by the original author of this answer):
// Vec4i operator * (Vec4i const & a, Vec4i const & b) {
// #ifdef
__m128i a13 = _mm_shuffle_epi32(a, 0xF5); // (-,a3,-,a1)
__m128i b13 = _mm_shuffle_epi32(b, 0xF5); // (-,b3,-,b1)
__m128i prod02 = _mm_mul_epu32(a, b); // (-,a2*b2,-,a0*b0)
__m128i prod13 = _mm_mul_epu32(a13, b13); // (-,a3*b3,-,a1*b1)
__m128i prod01 = _mm_unpacklo_epi32(prod02,prod13); // (-,-,a1*b1,a0*b0)
__m128i prod23 = _mm_unpackhi_epi32(prod02,prod13); // (-,-,a3*b3,a2*b2)
__m128i prod = _mm_unpacklo_epi64(prod01,prod23); // (ab3,ab2,ab1,ab0)