0

I am trying to use SIMD instructions to speed up the sum of elements in an array of uint8_t (i.e., sum reduction). For that purpose, I am replicating the most voted answer in this question:

Sum reduction of unsigned bytes without overflow, using SSE2 on Intel

The procedure for the sum reduction shown in that answer is this:

uint16_t sum_32(const uint8_t a[32])
{
    __m128i zero = _mm_xor_si128(zero, zero);
    __m128i sum0 = _mm_sad_epu8(
                        zero,
                        _mm_load_si128(reinterpret_cast<const __m128i*>(a)));
    __m128i sum1 = _mm_sad_epu8(
                        zero,
                        _mm_load_si128(reinterpret_cast<const __m128i*>(&a[16])));
    __m128i sum2 = _mm_add_epi16(sum0, sum1);
    __m128i totalsum = _mm_add_epi16(sum2, _mm_shuffle_epi32(sum2, 2));
    return totalsum.m128i_u16[0];
}

My problem is that the return operation (totalsum.m128i_u16[0]) seems to be available only for Microsoft, but I am using UNIX-based platforms.

I reviewed the list of SIMD intrinsics, and the function _mm_storeu_ps(a, t) seems to do something similar to what I require, but t has to be a __m128 variable and a a float. I tried to use that function by casting my result from __m128i to __m128, but it didn't work.

Is there another way in which I can retrieve the first 16 bits of a __m128i variable and store them into a uint16_t variable?. I am very new to SIMD programming.

BTW, is there a better solution for implementing sum reduction?. That answer is from 9 years ago. I imagine that now are better alternatives.

AAA
  • 111
  • 7
  • I tidied up harold's answer on [Sum reduction of unsigned bytes without overflow, using SSE2 on Intel](https://stackoverflow.com/a/10933578) - thanks for pointing out it it was old and weird, using MSVC-specific stuff. The asm was correct, and I'd seen many good answers from harold, but I didn't notice that was apparently from his early days with intrinsics when I linked it on the problems when linking it other places (like my canonical answer on [Fastest way to do horizontal SSE vector sum (or other reduction)](https://stackoverflow.com/a/35270026)). – Peter Cordes Mar 25 '22 at 15:36
  • Fixed one last thing I missed before: the manual xor-zeroing that reads the variable as part of its own initializer. Updated the linked answer. – Peter Cordes Mar 25 '22 at 18:35
  • N.B.: Instead of `_mm_xor_si128(zero, zero)` just use `_mm_setzero_si128()`. – chtz Mar 25 '22 at 19:01
  • 1
    @chtz: That was one of multiple things I fixed in [the old answer that code was copied from](https://stackoverflow.com/questions/10932550/sum-reduction-of-unsigned-bytes-without-overflow-using-sse2-on-intel/10933578#10933578). – Peter Cordes Mar 27 '22 at 16:33

1 Answers1

4

_mm_extract_epi16 for a compile-time known index.

For the first element _mm_cvtsi128_si32 gives more efficient instructions. This would work, given that:

  • _mm_sad_epu8 fills the the bits 16 thru 63 to zero
  • you truncate the result to 16 bits via uint16_t return type

Compilers may be able to do this optimization on their own, based on either of the reasons, but not all of them, so it is better to use _mm_cvtsi128_si32.

Alex Guteniev
  • 12,039
  • 2
  • 34
  • 79
  • 2
    @AAA: And if you want *all* the elements separately, often storing to an array and looping is good: [print a \_\_m128i variable](https://stackoverflow.com/a/46752535) (If you want to sum them, see [Fastest way to do horizontal SSE vector sum (or other reduction)](https://stackoverflow.com/a/35270026) - `psadbw` is still optimal.) – Peter Cordes Mar 25 '22 at 15:17
  • 1
    @Alex: compilers couldn't optimize `_mm_extract_epi16(v, 0)` into `movd` unless they could prove that the low dword of the vector was a zero-extended word; maybe you're thinking of `_mm_extract_epi32(v, 0)`. Since this is doing a `paddw` on `psadbw` results, that will be the case (and for only a couple vectors, `paddw` is wide enough not to truncate the result). – Peter Cordes Mar 25 '22 at 15:20
  • @PeterCordes, actually clang is able to take advantage of truncation and uses `movd` https://godbolt.org/z/e58arzn4f – Alex Guteniev Mar 25 '22 at 15:26
  • 2
    Yup, figured clang might be able to. The right answer here is to use `movd`; note that it is a psadbw result so your edit unfortunately made your answer more correct but less helpful. – Peter Cordes Mar 25 '22 at 15:30
  • @PeterCordes, got it, fixed. – Alex Guteniev Mar 25 '22 at 15:35
  • 1
    Much better. BTW, I updated harold's answer on [Sum reduction of unsigned bytes without overflow, using SSE2 on Intel](https://stackoverflow.com/a/10933578) to be good, using `paddd` intrinsics instead of `paddw` (so it's extensible to larger arrays), and using a movd intrinsic. And an `unsigned` return value to avoid making compilers waste instructions truncating / re-widening the `movd` result; looks like we both noticed that problem, too. – Peter Cordes Mar 25 '22 at 15:38