I have an _m256i vector with these 16bit short values for example (lo -> hi):
2140 4635 5716 4331 1863 0 0 0 0 0 0 0 0 0 0 0
I need to get the sum of these values (18685) using AVX intrinsics, but my assignment spec says I must not use high latency/throughput instructions like hadd and all the parallel parts must be fully vectorized.
I've tried following these answers and converting them to work with __m256i but to no avail. [Fastest way to do horizontal vector sum with AVX instructions [duplicate]] Fastest way to do horizontal SSE vector sum (or other reduction)
Example of what I've tried so far
original
double hsum_double_avx(__m256d v) {
__m128d vlow = _mm256_castpd256_pd128(v);
__m128d vhigh = _mm256_extractf128_pd(v, 1); // high 128
vlow = _mm_add_pd(vlow, vhigh); // reduce down to 128
__m128d high64 = _mm_unpackhi_pd(vlow, vlow);
return _mm_cvtsd_f64(_mm_add_sd(vlow, high64)); // reduce to scalar
}
mine
int hsum_int_avx(__m256i v) {
__m128i vLow = _mm256_castsi256_si128(v);
__m128i vHigh = _mm256_extracti128_si256(v, 0);
vLow = _mm_add_epi32(vLow, vHigh);
__m128i high64 = _mm_unpackhi_epi32(vLow, vLow);
return _mm_cvtsi128_si32(_mm_add_epi64(vLow, high64));
}
Output from mine
vLow: 2140 4635 5716 4331 1863 0 0 0
vHigh: 0 0 0 0 0 0 0 0
vLow: 2140 4635 5716 4331 1863 0 0 0
high64: 1863 0 1863 0 0 0 0 0
return value: 4003?