I am not familiar with x86_64 intrinsics, I'd like to have the following operations using 256bit vector registers. I was using _mm256_maddubs_epi16(a, b); however, it seems that this instruction has overflow issue since char*char can exceeds 16-bit maximum value. I have issue understanding _mm256_unpackhi_epi32 and related instructions.
Can anyone elaborate me and show me the light to the destination? Thank you!
int sumup_char_arrays(char *A, char *B, int size) {
assert (size % 32 == 0);
int sum = 0;
for (int i = 0; i < size; i++) {
sum += A[i]*B[i];
}
return sum;
}