Adding Array SIMD vs Unrolling

Question

I created a simple test that used SIMD to add 4 elements of an array at a time (to sum) vs just accumulating it with 4 sum variables and adding them up at the end. Here's my test case code:

#include <stdio.h> 
#include <stdlib.h> 
#include <time.h>
#include <omp.h>

#include <immintrin.h>
#include <x86intrin.h>

int main() 
{ 
    double time1, time2;
    time1 = time2 = 0;
    int n = 50000000;
    int runs = 5;
    double * test = _mm_malloc(sizeof(double) * n, 32);

    for(int i = 0; i < n; i++){
        test[i] = i;
    }

    time1 = omp_get_wtime();
    double overalla;
    for(int a = 0; a < runs; a++){
        __m256d accumulate = _mm256_setzero_pd();
        overalla = 0;
        for(int i = 0; i < n; i += 4){
            accumulate = _mm256_add_pd(_mm256_load_pd(test + i), accumulate);
        }
        double result[4] __attribute__ ((aligned (32)));
        _mm256_store_pd((double *)&result, accumulate);
        overalla = result[0] + result[1] + result[2] + result[3];
    }
    time1 = omp_get_wtime() - time1;

    double overall;
    time2 = omp_get_wtime();
    for(int a = 0; a < runs; a++){
        double sum1, sum2, sum3, sum4;
        sum1 = sum2 = sum3 = sum4 = 0;
        overall = 0;
        for(int i = 0; i < n; i += 4){
            sum1 += test[i];
            sum2 += test[i+1];
            sum3 += test[i+2];
            sum4 += test[i+3];
        }
        overall = sum1 + sum2 + sum3 + sum4;
    }
    time2 = omp_get_wtime() - time2;

    printf("A: %f, B: %f\n", overalla, overall);
    printf("Time 1: %f, Time 2: %f\n", time1, time2);
    printf("Unroll %f times faster\n", time1/time2);
}

I expected the SIMD to be significantly faster (4 adds at once), but this is not the case. I was wondering if anyone could point me to why that is? The result I get from running the code is:

A: 1249999975000000.000000, B: 1249999975000000.000000

Time 1: 0.317978, Time 2: 0.207965

Unroll 1.528996 times faster

I am compiling without optimizations, the gcc options are gcc -fopenmp -mavx -mfma -pthread

Unroll *and* use SIMD, with multiple vector accumulators [as shown in this Q&A](https://stackoverflow.com/questions/45113527/why-does-mulss-take-only-3-cycles-on-haswell-different-from-agners-instruction). Skylake has `vaddpd` latency of 4 cycles, but 2/clock load and add throughput (scalar or SIMD), assuming no stalls for cache misses, so you'd expect both versions to be equal. Also, you're going to want to use `-march=native` so GCC doesn't split unaligned loads ([Why doesn't gcc resolve \_mm256\_loadu\_pd as single vmovupd?](https://stackoverflow.com/q/52626726)) — Peter Cordes, Nov 24 '20 at 08:25
Except that you disabled optimization, completely defeating the point of benchmarking. — Peter Cordes, Nov 24 '20 at 08:26
After you've enabled optimisation you might also want to check whether your compiler is auto-vectorizing your scalar loop (I'd be surprised if it doesn't). — Paul R, Nov 24 '20 at 08:28
You can also use OpenMP SIMD mode to specifically request the loop be vectorized. gcc at least will only vectorize loops at -O3 unless you use it (Or the right additional optimization options). — Shawn, Nov 24 '20 at 09:07
Given that your data doesn't fit in any level of a typical processor's cache hierarchy, but your accesses are nice and sequential, you should be memory bandwidth limited. If you're not, you need to enable compiler optimizations. — EOF, Nov 24 '20 at 12:45
@PeterCordes Why would disabling optimizations defeat the point of this? I should still expect them to perform equally in this case without optimizations right? I’m unsure why there is such a significant difference here, the result seems to be extremely consistent. — AS425, Nov 24 '20 at 20:17
*I should still expect them to perform equally in this case without optimizations right?* No, that doesn't follow *at all*. Anti-optimized debug mode code has *different* bottlenecks than optimized code, typically on store-reload latency from spilling every variable to memory between C statements (including at least the loop counter in your case). **Debug builds are not a uniform slowdown.** See [C loop optimization help for final assignment (with compiler optimization disabled)](https://stackoverflow.com/a/32001196) for example. — Peter Cordes, Nov 24 '20 at 23:59
[Adding a redundant assignment speeds up code when compiled without optimization](https://stackoverflow.com/q/49189685) is probably related to the speedup you're seeing. Also the fact that 256-bit (32-byte) store/reload has an extra cycle of store-forwarding latency vs. 8-byte `vmovsd` on Intel CPUs, IIRC, and your debug build will be keeping `accumulate` and `sum0..3` in stack memory, not registers. — Peter Cordes, Nov 25 '20 at 00:01
@PeterCordes That makes sense thanks, I didn’t realize that gcc included a lot of debugging stuff with optimizations turned off. If it’s so bad for performance, why is it the default option? — AS425, Nov 25 '20 at 00:06
Because it compiles fast, and gives consistent debugging. When you're developing a program and trying to get it correct, it's a fairly good choice, that or `-Og`. [Why does clang produce inefficient asm with -O0 (for this simple floating point sum)?](https://stackoverflow.com/q/53366394) explains exactly why `-O0` codegen is like that. When you are doing performance tuning, it's a horrible choice. — Peter Cordes, Nov 25 '20 at 01:13

Adding Array SIMD vs Unrolling

0 Answers0