3

I am not familiar with x86_64 intrinsics, I'd like to have the following operations using 256bit vector registers. I was using _mm256_maddubs_epi16(a, b); however, it seems that this instruction has overflow issue since char*char can exceeds 16-bit maximum value. I have issue understanding _mm256_unpackhi_epi32 and related instructions.

Can anyone elaborate me and show me the light to the destination? Thank you!

int sumup_char_arrays(char *A, char *B, int size) {
    assert (size % 32 == 0);
    int sum = 0;
    for (int i = 0; i < size; i++) {
        sum += A[i]*B[i];
    }
    return sum;
}
Alex Xie
  • 91
  • 1
  • 12
  • 1
    I suggest you write an SSE version first, e.g. unpacking to 16 bits and then using `_mm_madd_epi16` to do the heavy lifting. That's probably enough of a challenge for a beginner, without all the fiddly split lane issues on AVX. You can always go from SSE to AVX later if you feel you need to. – Paul R Oct 17 '16 at 14:26

1 Answers1

1

I've figured out the solution, any idea to improve it, especially the final stage of reduction.

int sumup_char_arrays(char *A, char *B, int size) {
    assert (size % 32 == 0);
    int sum = 0;
    __m256i sum_tmp;
    for (int i = 0; i < size; i += 32) {
        __m256i ma_l = _mm256_cvtepi8_epi16(_mm_load_si128((__m128i*)A));
        __m256i ma_h = _mm256_cvtepi8_epi16(_mm_load_si128((__m128i*)(A+16)));
        __m256i mb_l = _mm256_cvtepi8_epi16(_mm_load_si128((__m128i*)B));
        __m256i mb_h = _mm256_cvtepi8_epi16(_mm_load_si128((__m128i*)(B+16)));
        __m256i mc = _mm256_madd_epi16(ma_l, mb_l);
        mc = _mm256_add_epi32(mc, _mm256_madd_epi16(ma_h, mb_h));
        sum_tmp = _mm256_add_epi32(mc, sum_tmp);
        //sum += A[i]*B[i];
    }
    sum_tmp = _mm256_add_epi32(sum_tmp, _mm256_permute2x128_si256(sum_tmp, sum_tmp, 0x81));
    sum_tmp = _mm256_add_epi32(sum_tmp, _mm256_srli_si256(sum_tmp, 8));
    sum_tmp = _mm256_add_epi32(sum_tmp, _mm256_srli_si256(sum_tmp, 4));        
    sum = _mm256_extract_epi32(sum_tmp, 0);
    return sum;
}
Alex Xie
  • 91
  • 1
  • 12
  • 1
    Looks good to me, unless one of your char arrays can be treated as unsigned so you can use [PMADDUBSW](http://www.felixcloutier.com/x86/PMADDUBSW.html). The horizontal reduction doesn't need permute, only extract and cast down to 128. See [this answer](http://stackoverflow.com/questions/6996764/fastest-way-to-do-horizontal-float-vector-sum-on-x86) for probably-optimal patterns for horizontal sums that might save a couple code bytes vs. yours. – Peter Cordes Oct 18 '16 at 08:01