1

I have an _m256i vector with these 16bit short values for example (lo -> hi):

2140 4635 5716 4331 1863 0 0 0 0 0 0 0 0 0 0 0

I need to get the sum of these values (18685) using AVX intrinsics, but my assignment spec says I must not use high latency/throughput instructions like hadd and all the parallel parts must be fully vectorized.

I've tried following these answers and converting them to work with __m256i but to no avail. [Fastest way to do horizontal vector sum with AVX instructions [duplicate]] Fastest way to do horizontal SSE vector sum (or other reduction)


Example of what I've tried so far

original

double hsum_double_avx(__m256d v) {
    __m128d vlow = _mm256_castpd256_pd128(v);
    __m128d vhigh = _mm256_extractf128_pd(v, 1); // high 128
    vlow = _mm_add_pd(vlow, vhigh);     // reduce down to 128

    __m128d high64 = _mm_unpackhi_pd(vlow, vlow);
    return  _mm_cvtsd_f64(_mm_add_sd(vlow, high64));  // reduce to scalar
}

mine

int hsum_int_avx(__m256i v) {

    __m128i vLow = _mm256_castsi256_si128(v);
    __m128i vHigh = _mm256_extracti128_si256(v, 0);

    vLow = _mm_add_epi32(vLow, vHigh);
    __m128i high64 = _mm_unpackhi_epi32(vLow, vLow);

    return _mm_cvtsi128_si32(_mm_add_epi64(vLow, high64));
}

Output from mine

vLow: 2140 4635 5716 4331 1863 0 0 0

vHigh: 0 0 0 0 0 0 0 0

vLow: 2140 4635 5716 4331 1863 0 0 0

high64: 1863 0 1863 0 0 0 0 0

return value: 4003?
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • You *definitely* don't want to use `_mm_add_pd`; that treats the 128-bit vector as 2x `double` in IEEE binary64 format, not 8x int16_t. That's what the `pd` means: you want to be using `epi16` additions. (Or if you need to avoid overflow, widen to 32-bit with `pmaddwd` against `set1_epi16(1)` then follow the answers you linked for 32-bit hsums.) – Peter Cordes Jul 26 '20 at 22:15
  • 1
    Do you need to avoid overflow, i.e. produce a 32-bit result, or do you want to truncate the sum to 16 bits? – Peter Cordes Jul 26 '20 at 22:18
  • 1
    [Fastest way to do horizontal SSE vector sum (or other reduction)](https://stackoverflow.com/a/35270026) which you linked *does* already describe how to do the int16_t case. Since this is homework, I'm not going to do it for you. Start with [SIMD: Accumulate Adjacent Pairs](https://stackoverflow.com/q/55057933) to get 8x 32-bit elements from your 16x 16-bit `__m256i`. Then do a normal 32-bit hsum as in `hsum_8x32` from [Fastest method to calculate sum of all packed 32-bit integers using AVX512 or AVX2](https://stackoverflow.com/q/60108658) – Peter Cordes Jul 26 '20 at 22:25
  • @PeterCordes Thank you I've got it working now! – Venis Kenis Jul 26 '20 at 23:05

0 Answers0