You can abuse PSADBW
to calculate horizontal sums of bytes without overflow.
For example:
pxor xmm0, xmm0
psadbw xmm0, [a + 0] ; sum in 2x 64-bit chunks
pxor xmm1, xmm1
psadbw xmm1, [a + 16]
paddw xmm0, xmm1 ; accumulate vertically
pshufd xmm1, xmm0, 2 ; bring down the high half
paddw xmm0, xmm1 ; low word in xmm0 is the total sum
; movd eax, xmm0 ; higher bytes are zero so efficient dword extract is fine
Intrinsics version:
#include <immintrin.h>
#include <stdint.h>
// use loadu instead of load if 16-byte alignment of a[] isn't guaranteed
unsigned sum_32x8(const uint8_t a[32])
{
__m128i zero = _mm_setzero_si128();
__m128i sum0 = _mm_sad_epu8( zero,
_mm_load_si128(reinterpret_cast<const __m128i*>(a)));
__m128i sum1 = _mm_sad_epu8( zero,
_mm_load_si128(reinterpret_cast<const __m128i*>(&a[16])));
__m128i sum2 = _mm_add_epi32(sum0, sum1);
__m128i totalsum = _mm_add_epi32(sum2, _mm_shuffle_epi32(sum2, 2));
return _mm_cvtsi128_si32(totalsum);
}
This portably compiles back to the same asm, as you can see on Godbolt.
The reinterpret_cast<const __m128i*>
is necessary because Intel intrinsics before AVX-512 for integer vector load/store take __m128i*
pointer args, instead of a more convenient void*
. Some prefer more compact C-style casts like _mm_loadu_si128( (const __m128*) &a[16] )
as a style choice.
16 vs. 32 vs. 64-bit SIMD element size doesn't matter much; 16 and 32 are equally efficient on all machines, and 32-bit will avoid overflow even if you use this for summing much larger arrays. (paddq
is slower on some old CPUs like Core 2; https://agner.org/optimize/ and https://uops.info/) Extracting as 32-bit is definitely more efficient than _mm_extract_epi16
(pextrw
).