SSE multiplication of 4 32-bit integers

Question

How to multiply four 32-bit integers by another 4 integers? I didn't find any instruction which can do it.

Paul R · Accepted Answer · 2018-02-01T16:37:08.227

25

If you need signed 32x32 bit integer multiplication then the following example at software.intel.com looks like it should do what you want:

static inline __m128i muly(const __m128i &a, const __m128i &b)
{
    __m128i tmp1 = _mm_mul_epu32(a,b); /* mul 2,0*/
    __m128i tmp2 = _mm_mul_epu32( _mm_srli_si128(a,4), _mm_srli_si128(b,4)); /* mul 3,1 */
    return _mm_unpacklo_epi32(_mm_shuffle_epi32(tmp1, _MM_SHUFFLE (0,0,2,0)), _mm_shuffle_epi32(tmp2, _MM_SHUFFLE (0,0,2,0))); /* shuffle results to [63..0] and pack */
}

You might want to have two builds - one for old CPUs and one for recent CPUs, in which case you could do the following:

static inline __m128i muly(const __m128i &a, const __m128i &b)
{
#ifdef __SSE4_1__  // modern CPU - use SSE 4.1
    return _mm_mullo_epi32(a, b);
#else               // old CPU - use SSE 2
    __m128i tmp1 = _mm_mul_epu32(a,b); /* mul 2,0*/
    __m128i tmp2 = _mm_mul_epu32( _mm_srli_si128(a,4), _mm_srli_si128(b,4)); /* mul 3,1 */
    return _mm_unpacklo_epi32(_mm_shuffle_epi32(tmp1, _MM_SHUFFLE (0,0,2,0)), _mm_shuffle_epi32(tmp2, _MM_SHUFFLE (0,0,2,0))); /* shuffle results to [63..0] and pack */
#endif
}

edited Feb 01 '18 at 16:37

answered May 08 '12 at 15:19

Paul R

208,748
37
389
560

3

Good answer. Funny you made exactly the same typo I had once in my code: It should be _____SSE4_1_____ (no underscore between E and 4). Annoying, because you don't notice it easily - the program runs perfect as long as the alternate code path is ok – Gunther Piez May 08 '12 at 16:00
1

@drhirsch: thanks for fixing that - in real code I tend to use `__MNI__`, `__SNI__`, etc - mainly for historical reasons, but it's also less prone to simple errors such as the above. – Paul R May 08 '12 at 16:45
1

Awesome, thanks! Now, if only there was a similar trick for replacing `_mm_insert_epi32` on a CPU with SSE2 only... – Mikhail T. Jul 14 '16 at 21:17
Note that the "old CPU path" may be actually slower than an explicit `for` loop over the 4 integers - definitely not faster on my machine. – Ivan Ivanov Jan 12 '17 at 18:11

score 8 · Answer 2 · answered May 08 '12 at 14:42

8

PMULLD, from SSE 4.1, does that.

The description is slightly misleading, it talks about signed multiplication, but since it only stores the lower 32bits, it's really a sign-oblivious instruction that you can use for both, just like IMUL.

answered May 08 '12 at 14:42

harold

61,398
6
86
164

Thanks. But is there a way to use only SSE 2 instructions? – Yury May 08 '12 at 14:45
3

`_mm_mullo_epi32` if you'd rather use intrinsics than raw assembly – Paul R May 08 '12 at 14:45
1

@Leviathan Yes, but you need several instructions. Depending on the architecture, four `imul` are possibly faster and simpler – Gunther Piez May 08 '12 at 14:49
1

Why only SSE2 ? Do you really need to support > 10 year old CPUs ? – Paul R May 08 '12 at 14:51
2

@PaulR There are still a lot of AMDs out there. Only Bulldozer supports SSSE3 and above. – Gunther Piez May 08 '12 at 14:52
Opteron *et al* support SSE3 though (not that this helps much in this particular case). – Paul R May 08 '12 at 14:53

SSE multiplication of 4 32-bit integers

2 Answers2

Linked