7

I need to horizontally add a __m128i that is 16 x epi8 values. The XOP instructions would make this trivial, but I don't have those available.

Current method is:

hd = _mm_hadd_epi16(_mm_cvtepi8_epi16(sum), _mm_cvtepi8_epi16(_mm_shuffle_epi8(sum, swap)));
hd = _mm_hadd_epi16(hd, hd);
hd = _mm_hadd_epi16(hd, hd);

Is there a better way with up to SSE4.1?

Paul R
  • 208,748
  • 37
  • 389
  • 560
Chase R Lewis
  • 2,119
  • 1
  • 22
  • 47
  • Related: [How to count character occurrences using SIMD](https://stackoverflow.com/q/54541129) sums up `_mm256_cmpeq_epi8` results, needing this operation as one of the steps in the outer loop. – Peter Cordes May 16 '21 at 13:59
  • Related: [How to horizontally sum signed bytes in XMM](https://stackoverflow.com/q/70370454) shows how to extend this for signed bytes. (And optionally, how to only sum 9 bytes instead of a full 16.) – Peter Cordes Dec 17 '21 at 05:25

1 Answers1

13

You can do it with SSE2's _mm_sad_epu8 (psadbw), e.g.:

inline uint32_t _mm_sum_epu8(const __m128i v)
{
    __m128i vsum = _mm_sad_epu8(v, _mm_setzero_si128());
    return _mm_cvtsi128_si32(vsum) + _mm_extract_epi16(vsum, 4);
}

If you're summing more than one vector of bytes, use _mm_add_epi32 (or 64) on the vsum result, only doing the final horizontal sum of two 32 (or 64-bit) halves to scalar once at the end.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 3
    Compilers fail to optimize `_mm_extract_epi16(vsum, 0)` into `movd` - they don't realize that the upper 2 bytes of the low dword will be 0 so they actually use `pextrw eax, xmm0, 0`. https://godbolt.org/z/TMb8rc1j4. Use `_mm_cvtsi128_si32(vsum)` instead to save a shuffle uop. I fixed that for you. – Peter Cordes May 16 '21 at 13:51