Fastest way to horizontally sum SSE unsigned byte vector

Question

I need to horizontally add a __m128i that is 16 x epi8 values. The XOP instructions would make this trivial, but I don't have those available.

Current method is:

hd = _mm_hadd_epi16(_mm_cvtepi8_epi16(sum), _mm_cvtepi8_epi16(_mm_shuffle_epi8(sum, swap)));
hd = _mm_hadd_epi16(hd, hd);
hd = _mm_hadd_epi16(hd, hd);

Is there a better way with up to SSE4.1?

Related: [How to count character occurrences using SIMD](https://stackoverflow.com/q/54541129) sums up `_mm256_cmpeq_epi8` results, needing this operation as one of the steps in the outer loop. — Peter Cordes, May 16 '21 at 13:59
Related: [How to horizontally sum signed bytes in XMM](https://stackoverflow.com/q/70370454) shows how to extend this for signed bytes. (And optionally, how to only sum 9 bytes instead of a full 16.) — Peter Cordes, Dec 17 '21 at 05:25

score 13 · Accepted Answer · edited May 16 '21 at 13:55

13

You can do it with SSE2's _mm_sad_epu8 (psadbw), e.g.:

inline uint32_t _mm_sum_epu8(const __m128i v)
{
    __m128i vsum = _mm_sad_epu8(v, _mm_setzero_si128());
    return _mm_cvtsi128_si32(vsum) + _mm_extract_epi16(vsum, 4);
}

If you're summing more than one vector of bytes, use _mm_add_epi32 (or 64) on the vsum result, only doing the final horizontal sum of two 32 (or 64-bit) halves to scalar once at the end.

edited May 16 '21 at 13:55

Peter Cordes

328,167
45
605
847

answered May 03 '16 at 08:04

Paul R

208,748
37
389
560

3

Compilers fail to optimize `_mm_extract_epi16(vsum, 0)` into `movd` - they don't realize that the upper 2 bytes of the low dword will be 0 so they actually use `pextrw eax, xmm0, 0`. https://godbolt.org/z/TMb8rc1j4. Use `_mm_cvtsi128_si32(vsum)` instead to save a shuffle uop. I fixed that for you. – Peter Cordes May 16 '21 at 13:51

Fastest way to horizontally sum SSE unsigned byte vector

1 Answers1

Linked

Related