1

I see an important performance difference between a SIMD-based sum reduction versus its scalar counterpart across different CPU architectures.

The problematic function is simple; you receive a 16-byte-aligned vector B of uint8_t elements and a range B[l,r], where l and r are multiples of 16. The function returns the sum of the elements within B[l,r].

This is my code:

\\SIMD version of the reduction
inline int simd_sum_red(size_t l, size_t r, const uint8_t* B) {

    __m128i zero = _mm_setzero_si128();
    __m128i sum0 = _mm_sad_epu8( zero, _mm_load_si128(reinterpret_cast<const __m128i*>(B+l)));
    l+=16;
    while(l<=r){
        __m128i sum1 = _mm_sad_epu8( zero, _mm_load_si128(reinterpret_cast<const __m128i*>(B+l)));
        sum0 = _mm_add_epi32(sum0, sum1);
        l+=16;
    }

    __m128i totalsum = _mm_add_epi32(sum0, _mm_shuffle_epi32(sum0, 2));
    return _mm_cvtsi128_si32(totalsum);
}

\\Regular reduction
inline size_t reg_sum_red(size_t l, size_t r, const uint8_t* B) {
    size_t acc=0;
    for(size_t i=l;i<=r;i++){
        acc+=B[i];
    }
    return acc;
}

It is worth mentioning that I built my SIMD function using the answers from another question I made a couple of days ago:

Accessing the fields of a __m128i variable in a portable way

For the experiments, I took random ranges of B of at most 256 elements (16 SIMD registers) and then measured the average number of nanoseconds that every function spent in a symbol of B[l,r]. I compared two CPU architectures; Apple M1 and Intel(R) Xeon(R) Silver 4110. I used the same source code for both cases, and the same compiler (g++) with the compiler flags -std=c++17 -msse4.2 -O3 -funroll-loops -fomit-frame-pointer -ffast-math. The only difference is that for Apple M1 I had to include an extra header called sse2neon.h that transforms Intel intrinsics to Neon intrinsics (the SIMD system for ARM-based architectures). I omitted the -msse4.2 flag for this case.

These are the results I obtained with the Apple M1 processor:

nanosecs/sym for reg_sum_red : 1.16952
nanosecs/sym for simd_sum_red : 0.383278

As you see, there is an important difference between using SIMD instructions, and not using them.

These are the results with the Intel(R) Xeon(R) Silver 4110 processor:

nanosecs/sym for reg_sum_red : 6.01793
nanosecs/sym for simd_sum_red : 5.94958

In this case, there is not a big difference.

I suppose the reason is because of the compilers I am using; gnu-gcc for Intel versus the Apple gcc. So what kind of compiler flags should I pass to gnu-gcc (Intel) to see a performance difference between the SIMD reduction and the regular reduction as good as the one I see in Apple?

Update:

I realized that g++ in OSx is an alias for Clang (also pointed out by @CodyGray), so I used different compilers in my previous experiments. Now, I tried Clang in the Intel Architecture, and indeed I obtained reductions similar to Apple. However, the question remains; is there any modification I can make in either the source code or the compiler flags to make my gcc-compiled source code as efficient as that of Clang?.

AAA
  • 111
  • 7
  • 4
    *"Apple gcc"* Are you sure it's actually GCC? As far as I understand, Apple aliases "gcc" to Clang, so you may well be using the Clang compiler to compile the M1 code. Clang has, in my experience, a significantly better optimizer, especially with intrinsics, than GCC does, so you may not be comparing apples to apples. If that's the case, then there are no flags that you can pass that will fix this. You can't pass flags to GCC to make it optimize intrinsics as well as Clang. – Cody Gray - on strike Mar 27 '22 at 13:08
  • @CodyGray, I realized that gcc is an alias for clang (as you said) in OSx just when I wrote this post. But in any case, I thought there must be a trick to make GCC produce a more optimized code (at least for this simple code). – AAA Mar 27 '22 at 15:54
  • `gcc -funroll-loops` (or `--fprofile-generate` / test run / `-fprofile-use`) can help (with `-O3 -march=native` of course), although GCC is still less smart about actually using multiple accumulators to hide latency. Most important for FP loops; most CPUs have 1c latency for integer stuff. Although you might get better than 1/clock `vpaddd` throughput out of this loop with AVX1 for your manual vectorized version. Or go 2x faster with AVX2 if cache can keep up.) – Peter Cordes Mar 27 '22 at 16:14
  • If you look at the disassembly of the scalar code (e.g. on [godbolt](https://godbolt.org/z/YnnfeTfqq)), you can see that the main problem is that the accumulation must extend the bytes to `std::size_t` first. However, if you batch the accumulation into 256-byte blocks, you know that `uint16_t` will be able to hold the sum of a batch, so you only need to extend each byte to `uint16_t`, then extend the sum to `size_t` after every batch. This gives you portable code that will run fast on every processor architecture with a SIMD unit, on every reasonably modern compiler. – EOF Mar 28 '22 at 15:36

0 Answers0