I see an important performance difference between a SIMD-based sum reduction versus its scalar counterpart across different CPU architectures.
The problematic function is simple; you receive a 16-byte-aligned vector B of uint8_t
elements and a range B[l,r]
, where l
and r
are multiples of 16. The function returns the sum of the elements within B[l,r]
.
This is my code:
\\SIMD version of the reduction
inline int simd_sum_red(size_t l, size_t r, const uint8_t* B) {
__m128i zero = _mm_setzero_si128();
__m128i sum0 = _mm_sad_epu8( zero, _mm_load_si128(reinterpret_cast<const __m128i*>(B+l)));
l+=16;
while(l<=r){
__m128i sum1 = _mm_sad_epu8( zero, _mm_load_si128(reinterpret_cast<const __m128i*>(B+l)));
sum0 = _mm_add_epi32(sum0, sum1);
l+=16;
}
__m128i totalsum = _mm_add_epi32(sum0, _mm_shuffle_epi32(sum0, 2));
return _mm_cvtsi128_si32(totalsum);
}
\\Regular reduction
inline size_t reg_sum_red(size_t l, size_t r, const uint8_t* B) {
size_t acc=0;
for(size_t i=l;i<=r;i++){
acc+=B[i];
}
return acc;
}
It is worth mentioning that I built my SIMD function using the answers from another question I made a couple of days ago:
Accessing the fields of a __m128i variable in a portable way
For the experiments, I took random ranges of B of at most 256 elements (16 SIMD registers) and then measured the average number of nanoseconds that every function spent in a symbol of B[l,r]
. I compared two CPU architectures; Apple M1
and Intel(R) Xeon(R) Silver 4110
. I used the same source code for both cases, and the same compiler (g++
) with the compiler flags -std=c++17 -msse4.2 -O3 -funroll-loops -fomit-frame-pointer -ffast-math
. The only difference is that for Apple M1
I had to include an extra header called sse2neon.h
that transforms Intel intrinsics to Neon intrinsics (the SIMD system for ARM-based architectures). I omitted the -msse4.2
flag for this case.
These are the results I obtained with the Apple M1
processor:
nanosecs/sym for reg_sum_red : 1.16952
nanosecs/sym for simd_sum_red : 0.383278
As you see, there is an important difference between using SIMD instructions, and not using them.
These are the results with the Intel(R) Xeon(R) Silver 4110
processor:
nanosecs/sym for reg_sum_red : 6.01793
nanosecs/sym for simd_sum_red : 5.94958
In this case, there is not a big difference.
I suppose the reason is because of the compilers I am using; gnu-gcc for Intel versus the Apple gcc. So what kind of compiler flags should I pass to gnu-gcc (Intel) to see a performance difference between the SIMD reduction and the regular reduction as good as the one I see in Apple?
Update:
I realized that g++ in OSx is an alias for Clang (also pointed out by @CodyGray), so I used different compilers in my previous experiments. Now, I tried Clang in the Intel
Architecture, and indeed I obtained reductions similar to Apple
. However, the question remains; is there any modification I can make in either the source code or the compiler flags to make my gcc-compiled source code as efficient as that of Clang?.