I need to horizontally add a __m128i
that is 16 x epi8 values. The XOP instructions would make this trivial, but I don't have those available.
Current method is:
hd = _mm_hadd_epi16(_mm_cvtepi8_epi16(sum), _mm_cvtepi8_epi16(_mm_shuffle_epi8(sum, swap)));
hd = _mm_hadd_epi16(hd, hd);
hd = _mm_hadd_epi16(hd, hd);
Is there a better way with up to SSE4.1?