Fastest way to multiply two vectors of 32bit integers in C++, with SSE

Question

I have two unsigned vectors, both with size 4

vector<unsigned> v1 = {2, 4, 6, 8}
vector<unsigned> v2 = {1, 10, 11, 13}

Now I want to multiply these two vectors and get a new one

vector<unsigned> v_result = {2*1, 4*10, 6*11, 8*13}

What is the SSE operation to use? Is it cross platform or only in some specified platforms?

Adding: If my goal is adding not multiplication, I can do this super fast:

__m128i a = _mm_set_epi32(1,2,3,4);
__m128i b = _mm_set_epi32(1,2,3,4);
__m128i c;
c = _mm_add_epi32(a,b);

Even if the compiler *could* infer enough about size, alignment, etc., to satisfy vectorization, I doubt it would use SSE here because of the load / store costs involved. — Brett Hale, Jun 23 '13 at 19:40
@rwols: `mulps` does single precision multiplication, the OP wants unsigned integer multiplication. — Skizz, Jun 23 '13 at 19:59
Yes, but I am not sure about the platforms, is it available everywhere? If not what is the work around for it? — WhatABeautifulWorld, Jun 23 '13 at 20:00

score 4 · Answer 1 · edited May 23 '17 at 12:24

Using the set intrinsics such as _mm_set_epi32 for all elements is inefficient. It's better to use the load intrinsics. See this discussion for more on that Where does the SSE instructions outperform normal instructions . If the arrays are 16 byte aligned you can use either _mm_load_si128 or _mm_loadu_si128 (for aligned memory they have nearly the same efficiency) otherwise use _mm_loadu_si128. But aligned memory is much more efficient. To get aligned memory I recommend _mm_malloc and _mm_free, or C11 aligned_alloc so you can use normal free.

To answer the rest of your question, lets assume you have your two vectors loaded in SSE registers __m128i a and __m128i b

For SSE version >=SSE4.1 use

_mm_mullo_epi32(a, b);

Without SSE4.1:

This code is copied from Agner Fog's Vector Class Library (and was plagiarized by the original author of this answer):

// Vec4i operator * (Vec4i const & a, Vec4i const & b) {
// #ifdef
__m128i a13    = _mm_shuffle_epi32(a, 0xF5);          // (-,a3,-,a1)
__m128i b13    = _mm_shuffle_epi32(b, 0xF5);          // (-,b3,-,b1)
__m128i prod02 = _mm_mul_epu32(a, b);                 // (-,a2*b2,-,a0*b0)
__m128i prod13 = _mm_mul_epu32(a13, b13);             // (-,a3*b3,-,a1*b1)
__m128i prod01 = _mm_unpacklo_epi32(prod02,prod13);   // (-,-,a1*b1,a0*b0) 
__m128i prod23 = _mm_unpackhi_epi32(prod02,prod13);   // (-,-,a3*b3,a2*b2) 
__m128i prod   = _mm_unpacklo_epi64(prod01,prod23);   // (ab3,ab2,ab1,ab0)

score 2 · Answer 2 · answered Jun 23 '13 at 20:15

There is _mm_mul_epu32 which is SSE2 only and uses the pmuludq instruction. Since it's an SSE2 instruction 99.9% of all CPUs support it (I think the most modern CPU that doesn't is an AMD Athlon XP).

It has a significant downside in that it only multiplies two integers at a time, because it returns 64-bit results, and you can only fit two of those in a register. This means you'll probably need to do a bunch of shuffling which adds to the cost.

Pixelchemist · Answer 3 · 2016-06-09T11:00:08.853

1

You can (if SSE 4.1 is available) use

__m128i _mm_mullo_epi32 (__m128i a, __m128i b);

to multiply packed 32bit integers. Otherwise you'd have to shuffle both packs in order to use _mm_mul_epu32 twice. See @user2088790's answer for explicit code.

Note that you could also use _mm_mul_epi32 but that is SSE4 so you'd rather use _mm_mullo_epi32 anyway.

edited Jun 09 '16 at 11:00

answered Jun 23 '13 at 19:54

Pixelchemist

24,090
7
47
71

Right, I am asking about the platforms for this _mm_mul_epi32. Is it available everywhere or only in a handful of places? – WhatABeautifulWorld Jun 23 '13 at 19:58
See [Wikipedia/SSE4](http://en.wikipedia.org/wiki/SSE4) for infos on which architectures it will be present. AMD has it since K10 and Intel since Core 2 days. – Pixelchemist Jun 23 '13 at 20:00
The OP seems to be asking for 32bit results, not a widening / full multiplication (32x32->64). Since SSE4.1 also adds `_mm_mullo_epi32` (`pmulld`), which gives four 32bit results, this is the wrong answer, and @user2088790's answer is the right answer. SSE4.1's `_mm_mul_epi32` is the signed version of SSE2 `_mm_mul_epu32`. – Peter Cordes Jun 09 '16 at 06:22
@PeterCordes: The answer says that it is a multiplication of of the low halves into 64bit. It also says you'd need one more call to `_mm_mul_epi32` as well as some shuffeling. This is essentially what the answer of user2088790 does with the unsigned version `_mm_mul_epu32`. I admit however, that I suggested the signed intrinsic where OP indicated unsigned values. – Pixelchemist Jun 09 '16 at 06:51
@Pixelchemist: My main point is that since your suggestion requires SSE4.1 anyway, you should use `_mm_mullo_epi32` to get four 32bit results. Notice the SSE4.1 section of user2088790's answer. `_mm_mullo_epi32` is slower than `_mm_mul_epu32` [on some CPUs](http://agner.org/optimize/), but it's still faster than two multiplies + shuffling, so it's the only valid answer for SSE4.1. (Also, if you're throwing away the high half, signed vs. unsigned doesn't matter). – Peter Cordes Jun 09 '16 at 07:35
@PeterCordes: And my main point is that while I agree that it is inferior to other solutions since 1. you'd use `_mm_mullo_epi32` if having SSE4 anyway and 2. you'd have to use `_mm_mul_epu32` if not having SSE4, it is technically valid to use shuffling with `_mm_mul_epi32` for signed values. Thus, the answer was suboptimal but not completely wrong. – Pixelchemist Jun 09 '16 at 11:00
The whole point of using SIMD intrinsics is high performance. Shuffling `_mm_mul_epi32` output would work, but absolutely deserved a downvote (which I've removed since you fixed your answer). – Peter Cordes Jun 09 '16 at 11:17

wim · Answer 4 · 2013-06-23T20:25:36.513

1

Probably _mm_mullo_epi32 is what you need, although its intended use is for signed integers. This should not cause problems as long as v1 and v2 are such small that the most significant bits of these integers are 0. It's SSE 4.1. As an alternative you might want to consider _mm_mul_epu32.

edited Jun 23 '13 at 20:25

answered Jun 23 '13 at 20:00

wim

3,702
19
23

1

Signedness is irrelevant for the low word of multiplication. It shouldn't have been documented as a signed multiplication - it isn't, it's a sign-oblivious multiplication. It's a silly as documenting `add` as "signed addition". Of course, they made the same mistake with `imul`. – harold Jun 23 '13 at 20:37
@harold: I agree. Good point. Table 2.1 in the Intel [SSE4 programming reference](http://software.intel.com/sites/default/files/m/9/4/2/d/5/17971-intel_20sse4_20programming_20reference.pdf), is quite confusing to me. – wim Jun 24 '13 at 20:59

score 0 · Answer 5 · answered Jun 23 '13 at 19:23

0

std::transform applies the given function to a range and stores the result in another range

std::vector<unsigned> result;

std::transform( v1.begin()+1, v1.end(), v2.begin()+1, v.begin(),std::multiplies<unsigned>() );

answered Jun 23 '13 at 19:23

fatihk

7,789
1
26
48

Fastest way to multiply two vectors of 32bit integers in C++, with SSE

5 Answers5

Linked