I'm trying to vectorize the following program:
for(i=0;i<N;i++)
{
a = arr[i];
//arithmetic on *a* here.
count[a]++;
}
Using intrinsics this becomes something like:
for(i=0;i<N;i+=8)
{
__m512i a = _mm512_loadu_epi64(arr+i);
//arithmetic on *a* here.
__m512i gather_a = _mm512_i64gather_epi64(a,cnt,8);
int64_t val1 = 1;
temp = _mm_cvtsi64_si128(val1);
__m512i one = _mm512_broadcastq_epi64(temp);
__m512i added = _mm512_add_epi64(gather_a, one); //count[a]++;
_mm512_i64scatter_epi64(count,a,added,8);
}
The problem is that the vectorized version's results in the output count array seems to be slightly off here and there. Is this problem related to the atomicity of AVX gather/scatter intrinsics or is this some other problem related to aliasing on the array?
Thanks