SSE multiplication 16 x uint8_t

Question

I want to multiply with SSE4 a __m128i object with 16 unsigned 8 bit integers, but I could only find an intrinsic for multiplying 16 bit integers. Is there nothing such as _mm_mult_epi8?

Could you clarify your question a bit? Do you want to multiply a 128bit integer with 16 8bit integers each or 16 8bit integers with 16 8bit integers or the 16 8bit integers in a single register with each other. The former case would be a bit strange. — Christian Rau, Nov 19 '11 at 11:20
Just a thought but why not pad the 8bit to 16? and if you want to test overflow you can just AND the AH and see if there is match to check for overflow. A bit messy and just a stab in the dark. It would also suprise me if there was support directly for 8 bit mul as the instruction set for SIMD was written for post 8 bit processors — Paul Sullivan, Nov 19 '11 at 11:25
@Paul: 8-bit values are still used in graphics. AltiVec has 8-bit multiply, although only 8 at a time with 16-bit results. — Potatoswatter, Nov 19 '11 at 11:36

score 20 · Answer 1 · edited Aug 21 '17 at 07:58

20

A (potentially) faster way than Marat's solution based on Agner Fog's solution:

Instead of splitting hi/low, split odd/even. This has the added benefit that it works with pure SSE2 instead of requiring SSE4.1 (of no use to the OP, but a nice added bonus for some). I also added an optimization if you have AVX2. Technically the AVX2 optimization works with only SSE2 intrinsics, but it's slower than the shift left then right solution.

__m128i mullo_epi8(__m128i a, __m128i b)
{
    // unpack and multiply
    __m128i dst_even = _mm_mullo_epi16(a, b);
    __m128i dst_odd = _mm_mullo_epi16(_mm_srli_epi16(a, 8),_mm_srli_epi16(b, 8));
    // repack
#ifdef __AVX2__
    // only faster if have access to VPBROADCASTW
    return _mm_or_si128(_mm_slli_epi16(dst_odd, 8), _mm_and_si128(dst_even, _mm_set1_epi16(0xFF)));
#else
    return _mm_or_si128(_mm_slli_epi16(dst_odd, 8), _mm_srli_epi16(_mm_slli_epi16(dst_even,8), 8));
#endif
}

Agner uses the blendv_epi8 intrinsic with SSE4.1 support.

Edit:

Interestingly, after doing more disassembly work (with optimized builds), at least my two implementations get compiled to exactly the same thing. Example disassembly targeting "ivy-bridge" (AVX).

vpmullw xmm2,xmm0,xmm1
vpsrlw xmm0,xmm0,0x8
vpsrlw xmm1,xmm1,0x8
vpmullw xmm0,xmm0,xmm1
vpsllw xmm0,xmm0,0x8
vpand xmm1,xmm2,XMMWORD PTR [rip+0x281]
vpor xmm0,xmm0,xmm1

It uses the "AVX2-optimized" version with a pre-compiled 128-bit xmm constant. Compiling with only SSE2 support produces a similar results (though using SSE2 instructions). I suspect Agner Fog's original solution might get optimized to the same thing (would be crazy if it didn't). No idea how Marat's original solution compares in an optimized build, though for me having a single method for all x86 simd extensions newer than and including SSE2 is quite nice.

edited Aug 21 '17 at 07:58

Paul R

208,748
37
389
560

answered Mar 19 '15 at 21:52

helloworld922

10,801
5
48
85

2

This is really nice. It takes advantage of the fact that the signed vs. unsigned only affects the high half of a N x N -> 2N bit multiply, and [that garbage in the high bits doesn't affect the result you want in the low bits](http://stackoverflow.com/questions/34377711/which-2s-complement-integer-operations-can-be-used-without-zeroing-high-bits-in). If cache-misses when loading the mask are a problem, you can generate it on the fly with 2 insns: `pcmpeqw xmm7,xmm7` / `psrlw xmm7, 8`. (See http://stackoverflow.com/q/35085059/224132 for other const-generation sequences). – Peter Cordes Feb 01 '16 at 05:38
1

That's neat, I see [clang optimizes the shift-left / shift-right to a `vpand` with a constant mask](http://goo.gl/GmFc9H)! It's probably better code, unless the mask tends to miss in cache. gcc doesn't do that optimization. The choice between shift and mask doesn't depend on AVX2 at all. It depends instead on whether a big constant from memory is what you want. (I notice that without avx, clang wastes a movdqa at the end: it could have used `pmullw xmm0, xmm1` for the 2nd pmul and built up the final result in `xmm0` (the return-value register). – Peter Cordes Feb 01 '16 at 06:04
1

Your comment about `vpbroadcastw` is totally wrong: Most compilers don't compile `set1` into a run-time broadcast for constants, because it's expensive. `mov eax,0xff` / `movd xmm0,eax` / vpbroadcastw xmm0,xmm0` is 3 uops on Haswell. `vpbroadcastw xmm0, [mem16]` is also 3 uops. Generating on the fly is cheaper than either (but compilers tend to just throw them in memory). However, `vpbroadcastd` from memory is only 1 uop, even unfused: it only needs a load port, not ALU. So you don't need to waste 32B of memory on a constant that's going to be loaded outside the loop. – Peter Cordes Feb 01 '16 at 06:07
1

So anyway, in a loop where the mask can be kept around in a register, it's prob. best to generate it on the fly rather than load it (with or without broadcast). If not in a loop, it might be best to save uop-cache space and just use a mask directly from memory (esp. if it's only 128b, not 256b or 512b). – Peter Cordes Feb 01 '16 at 06:09
The mask can be avoided completely if one of the low halves is shifted left by 8 before the multiply. That puts the desired byte in the high 8 bits where it can be shifted right by 8. That's two shifts instead of materializing the mask (2 instructions) and the pand. – bbudge Jul 10 '18 at 23:44
1

Update on broadcast constants: the smart option which some compilers are getting better at would be `vpbroadcastd xmm0, [mem32]` - repeat the word twice in a dword, and broadcast-load at runtime. Dword broadcasts from memory are free on Intel CPUs since at least Haswell, and recent AMD (https://uops.info/), except for code-size vs. `vmovdqa` being 1 byte smaller. But of course much larger total code+rodata. – Peter Cordes May 11 '23 at 17:51

score 13 · Accepted Answer · edited Nov 17 '16 at 16:21

There is no 8-bit multiplication in MMX/SSE/AVX. However, you can emulate 8-bit multiplication intrinsic using 16-bit multiplication as follows:

inline __m128i _mm_mullo_epi8(__m128i a, __m128i b)
{
    __m128i zero = _mm_setzero_si128();
    __m128i Alo = _mm_cvtepu8_epi16(a);
    __m128i Ahi = _mm_unpackhi_epi8(a, zero);
    __m128i Blo = _mm_cvtepu8_epi16(b);
    __m128i Bhi = _mm_unpackhi_epi8(b, zero);
    __m128i Clo = _mm_mullo_epi16(Alo, Blo);
    __m128i Chi = _mm_mullo_epi16(Ahi, Bhi);
    __m128i maskLo = _mm_set_epi8(0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 14, 12, 10, 8, 6, 4, 2, 0);
    __m128i maskHi = _mm_set_epi8(14, 12, 10, 8, 6, 4, 2, 0, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80, 0x80);
    __m128i C = _mm_or_si128(_mm_shuffle_epi8(Clo, maskLo), _mm_shuffle_epi8(Chi, maskHi));

     return C;
}

Paul R · Answer 3 · 2023-02-15T07:43:50.920

The only 8 bit SSE multiply instruction is PMADDUBSW (SSSE3 and later, C/C++ intrinsic: _mm_maddubs_epi16). This multiplies 16 x 8 bit unsigned values by 16 x 8 bit signed values and then sums adjacent pairs to give 8 x 16 bit signed results. If you can't use this rather specialised instruction then you'll need to unpack to pairs of 16 bit vectors and use regular 16 bit multiply instructions. Obviously this implies at least a 2x throughput hit so use the 8 bit multiply if you possibly can.

SSE multiplication 16 x uint8_t

3 Answers3

Linked