SSE addition is slower than + operator

Question

I was trying to test how fast is SSE addition but something is not right. I have created two arrays for inputs and one array for output in stack and performing additions on them in both ways. Its slower than regular + operator. What am I doing wrong here :

#include <iostream>
#include <nmmintrin.h>
#include <chrono>

using namespace std;

#define USE_SSE

typedef chrono::steady_clock::time_point TimeStamp;
typedef chrono::steady_clock Clock;
int main()
{
    const int MAX = 100000 * 4;
    float in1[MAX];
    float in2[MAX];
    float out[MAX];

    memset(out,0,sizeof(float) * MAX);

    for(int i = 0 ; i < MAX ; ++i)
    {
        in1[i] = 1.0f;
        in2[i] = 1.0f;
    }

    TimeStamp start,end;
    start = Clock::now();

    for(int i = 0 ; i < MAX ; i+=4)
    {
#ifdef USE_SSE

        __m128 a = _mm_load_ps(&in1[i]);
        __m128 b = _mm_load_ps(&in2[i]);
        __m128 result = _mm_add_ps(a,b);
        _mm_store_ps(&out[i],result);
#else
        out[0] = in1[0] + in2[0];
        out[1] = in1[1] + in2[1];
        out[2] = in1[2] + in2[2];
        out[3] = in1[3] + in2[3];
#endif
    }


    end = Clock::now();
    double dt = chrono::duration_cast<chrono::nanoseconds>(end-start).count();
    cout<<dt<<endl;

    return 0;
}

is memory alignment issue here?

Making the SSE version traverse the arrays but not the scalar version doesn't look fair.. — harold, Nov 15 '18 at 05:09
_Any_ decent optimising compiler would've converted the non-sse path to actually use SSE under the hood anyway. Infact it would not only _vectorise_ your loop it would also very likely unroll it aswell. In that case the naive loop would very likely be faster. Compiler really are very clever and its hard (but not impossible) to beat them. The only way to actually be sure of whats going on is to examine the generated code (asm). Also remember to turn optimisations up whenever timing code - timing non-optimised code is generally pointless. — Mike Vine, Nov 15 '18 at 08:41

score 3 · Answer 1 · answered Nov 15 '18 at 05:49

You have a bug in your code, the non-SSE part should read as:

    out[i+0] = in1[i+0] + in2[i+0];
    out[i+1] = in1[i+1] + in2[i+1];
    out[i+2] = in1[i+2] + in2[i+2];
    out[i+3] = in1[i+3] + in2[i+3];

You should consider making your benchmark to run a little bit longer, as measuring short time periods is unreliable. And perhaps, you'll need to do something to prevent the compiler to optimize away your code (like marking out volatile). Always check out the assembly code, to be sure what you measure.

Paul R · Answer 2 · 2018-11-15T13:03:57.073

Here is a somewhat improved version of your benchmark, with bug fixes, improvements to timing, and compiler vectorization disabled for the scalar code (for gcc and clang at least):

#include <iostream>
#include <xmmintrin.h>
#include <chrono>

using namespace std;

typedef chrono::steady_clock::time_point TimeStamp;
typedef chrono::steady_clock Clock;

typedef void (*add_func)(const float *in1, const float *in2, volatile float *out, const size_t n);

#ifndef __clang__
__attribute__((optimize("no-tree-vectorize")))
#endif
static void add_scalar(const float *in1, const float *in2, volatile float *out, const size_t n)
{
#ifdef __clang__
    #pragma clang loop vectorize(disable)
#endif
    for (size_t i = 0 ; i < n ; i += 4)
    {
        out[i + 0] = in1[i + 0] + in2[i + 0];
        out[i + 1] = in1[i + 1] + in2[i + 1];
        out[i + 2] = in1[i + 2] + in2[i + 2];
        out[i + 3] = in1[i + 3] + in2[i + 3];
    }
}

static void add_SIMD(const float *in1, const float *in2, volatile float *out, const size_t n)
{
    for (size_t i = 0 ; i < n ; i += 4)
    {
        __m128 a = _mm_loadu_ps(&in1[i]);
        __m128 b = _mm_loadu_ps(&in2[i]);
        __m128 result = _mm_add_ps(a, b);
        _mm_storeu_ps((float *)&out[i], result);
    }
}

static double time_func(const float *in1, const float *in2, volatile float *out, const size_t n, add_func f)
{
    const size_t kLoops = 10000;

    TimeStamp start,end;
    start = Clock::now();

    for (size_t k = 0; k < kLoops; ++k)
    {
        f(in1, in2, out, n);
    }

    end = Clock::now();

    return chrono::duration_cast<chrono::nanoseconds>(end - start).count() / ((double)kLoops * (double)n);
}

int main()
{
    const size_t n = 100000 * 4;
    float *in1 = new float[n];
    float *in2 = new float[n];
    volatile float *out = new float[n]();

    for (size_t i = 0; i < n; ++i)
    {
        in1[i] = (float)i;
        in2[i] = 1.0f;
    }

    double t_scalar = time_func(in1, in2, out, n, add_scalar);
    double t_SIMD = time_func(in1, in2, out, n, add_SIMD);

    cout << "t_scalar = " << t_scalar << " ns / point" << endl;
    cout << "t_SIMD   = " << t_SIMD << " ns / point" << endl;
    cout << "speed-up = " << t_scalar / t_SIMD << "x" << endl;

    delete [] in1;
    delete [] in2;
    delete [] out;

    return 0;
}

I get around 1.5x to 1.6x improvement for SSE on a Haswell CPU. This is obviosuly less than the 4x theoretical improvement which might be possible, but the test is most likely bandwidth constrained due to the fact that you are only performing 1 x arithmetic operation per iteration, but 2 x loads and 1 x store:

t_scalar = 0.529723 ns / point
t_SIMD   = 0.329758 ns / point
speed-up = 1.6064x

I thought Haswell had more than enough memory bandwidth for single-core AVX. Part of the problem might be latency, not bandwidth. — MSalters, Nov 15 '18 at 10:13
@MSalters: hard to say, but it's definitely memory-related - if you drop n down to 100 * 4 so that the test is running purely out of L1 then you get the expected 4x improvement. — Paul R, Nov 15 '18 at 10:27
@MSalters: 64 bytes loaded + 32 bytes stored per core clock cycle is a *huge* amount of bandwidth. L1d can do that in a burst, but apparently can't quite sustain it. Even L2 is a significant bottleneck for `a[i] = b[i]+c[i]` . Another problem is that a single core can't saturate DRAM, especially in a many-core Xeon with many DRAM channels, where single-core bandwidth is more limited by `max_concurrency * latency` with higher offcore latency. [Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?](https://stackoverflow.com/q/39260020) — Peter Cordes, Nov 15 '18 at 11:32

score 1 · Answer 3 · answered Nov 23 '18 at 05:34

Sometimes attempting to 'optimise' C++ code by adding loops to test timing is pretty silly in general, and this is one of those cases :(

Your code LITERALLY boils down to nothing more than this:

int main()
{
    TimeStamp start = Clock::now();
    TimeStamp end = Clock::now();

    double dt = chrono::duration_cast<chrono::nanoseconds>(end-start).count();
    cout<<dt<<endl;

    return 0;
}

The compiler isn't stupid, and so it has decided to remove your inner loop (since the output is unused, and so the loop is redundant).

Even if the compiler decided to keep your loop around, you're issuing 3 memory instructions for each addition. If your ram is 1600Mhz, and your CPU is 3200Mhz, then your tests are simply proving to you that you are memory bandwidth limited. Profiling loops like this isn't useful, you'll always be better off testing a real world situation in a profiler....

Anyhow, back to the loop in question. Let's throw the code into compiler explorer and play around with some options...

https://godbolt.org/z/5SJQHb

F0: Just a basic, boring C loop.

for(int i = 0 ; i < MAX ; i++)
{
    out[i] = in1[i] + in2[i];
}

The compiler outputs this inner loop:

vmovups ymm0,YMMWORD PTR [rsi+r8*4]
vmovups ymm1,YMMWORD PTR [rsi+r8*4+0x20]
vmovups ymm2,YMMWORD PTR [rsi+r8*4+0x40]
vmovups ymm3,YMMWORD PTR [rsi+r8*4+0x60]
vaddps ymm0,ymm0,YMMWORD PTR [rdx+r8*4]
vaddps ymm1,ymm1,YMMWORD PTR [rdx+r8*4+0x20]
vaddps ymm2,ymm2,YMMWORD PTR [rdx+r8*4+0x40]
vaddps ymm3,ymm3,YMMWORD PTR [rdx+r8*4+0x60]
vmovups YMMWORD PTR [rdi+r8*4],ymm0
vmovups YMMWORD PTR [rdi+r8*4+0x20],ymm1
vmovups YMMWORD PTR [rdi+r8*4+0x40],ymm2
vmovups YMMWORD PTR [rdi+r8*4+0x60],ymm3

Unrolled, dealing with 32xfloats per iteration (in AVX2) [+extra code to handle up to 31 elements at the end of the iteration]

F1: Your SSE 'optimised' loop above. (Obviously this code doesn't handle up to 3 elements at the end of the loop)

for(int i = 0 ; i < MAX ; i+=4)
{
    __m128 a = _mm_load_ps(&in1[i]);
    __m128 b = _mm_load_ps(&in2[i]);
    __m128 result = _mm_add_ps(a,b);
    _mm_store_ps(&out[i],result);
}

This outputs:

vmovaps xmm0,XMMWORD PTR [rsi+rcx*4]
vaddps xmm0,xmm0,XMMWORD PTR [rdx+rcx*4]
vmovaps XMMWORD PTR [rdi+rcx*4],xmm0
vmovaps xmm0,XMMWORD PTR [rsi+rcx*4+0x10]
vaddps xmm0,xmm0,XMMWORD PTR [rdx+rcx*4+0x10]
vmovaps XMMWORD PTR [rdi+rcx*4+0x10],xmm0
vmovaps xmm0,XMMWORD PTR [rsi+rcx*4+0x20]
vaddps xmm0,xmm0,XMMWORD PTR [rdx+rcx*4+0x20]
vmovaps XMMWORD PTR [rdi+rcx*4+0x20],xmm0
vmovaps xmm0,XMMWORD PTR [rsi+rcx*4+0x30]
vaddps xmm0,xmm0,XMMWORD PTR [rdx+rcx*4+0x30]
vmovaps XMMWORD PTR [rdi+rcx*4+0x30],xmm0

So the compiler has unrolled the loop, but it's fallen back to SSE (as requested), so is now half the performance of the original loop (not quite true - memory bandwidth will be the limiting factor here).

F2: Your manually unrolled C++ loop (with the indices corrected, and still fails to handle the last 3 elements)

for(int i = 0 ; i < MAX ; i += 4)
{
    out[i + 0] = in1[i + 0] + in2[i + 0];
    out[i + 1] = in1[i + 1] + in2[i + 1];
    out[i + 2] = in1[i + 2] + in2[i + 2];
    out[i + 3] = in1[i + 3] + in2[i + 3];
}

And the output:

vmovss xmm0,DWORD PTR [rsi+rax*4]
vaddss xmm0,xmm0,DWORD PTR [rdx+rax*4]
vmovss DWORD PTR [rdi+rax*4],xmm0
vmovss xmm0,DWORD PTR [rsi+rax*4+0x4]
vaddss xmm0,xmm0,DWORD PTR [rdx+rax*4+0x4]
vmovss DWORD PTR [rdi+rax*4+0x4],xmm0
vmovss xmm0,DWORD PTR [rsi+rax*4+0x8]
vaddss xmm0,xmm0,DWORD PTR [rdx+rax*4+0x8]
vmovss DWORD PTR [rdi+rax*4+0x8],xmm0
vmovss xmm0,DWORD PTR [rsi+rax*4+0xc]
vaddss xmm0,xmm0,DWORD PTR [rdx+rax*4+0xc]
vmovss DWORD PTR [rdi+rax*4+0xc],xmm0

Well, this has completely failed to vectorise! It's simply processing 1 addition at a time. Well, this is usually down to pointer aliasing, so I'll change the function prototype from this:

void func(float* out, const float* in1, const float* in2, int MAX);

to this: (F4)

void func(
    float* __restrict out, 
    const float* __restrict in1, 
    const float* __restrict in2, 
    int MAX);

and now the compiler will output something that is vectorised:

vmovups xmm0,XMMWORD PTR [rsi+rcx*4]
vaddps xmm0,xmm0,XMMWORD PTR [rdx+rcx*4]
vmovups xmm1,XMMWORD PTR [rsi+rcx*4+0x10]
vaddps xmm1,xmm1,XMMWORD PTR [rdx+rcx*4+0x10]
vmovups XMMWORD PTR [rdi+rcx*4],xmm0
vmovups xmm0,XMMWORD PTR [rsi+rcx*4+0x20]
vaddps xmm0,xmm0,XMMWORD PTR [rdx+rcx*4+0x20]
vmovups XMMWORD PTR [rdi+rcx*4+0x10],xmm1
vmovups xmm1,XMMWORD PTR [rsi+rcx*4+0x30]
vaddps xmm1,xmm1,XMMWORD PTR [rdx+rcx*4+0x30]
vmovups XMMWORD PTR [rdi+rcx*4+0x20],xmm0
vmovups XMMWORD PTR [rdi+rcx*4+0x30],xmm1

HOWEVER this code is still half the performance of the first version....

DRAM clock speed is one potential bottleneck for memory bandwidth, but limited memory-parallelism from one core is another: a single core can't even saturate dual-channel memory on a modern desktop CPU, and it's far worse on a many-core Xeon: higher potential aggregate bandwidth, but *lower* per-core max bandwidth. [Why is Skylake so much better than Broadwell-E for single-threaded memory throughput?](https://stackoverflow.com/q/39260020) Anyway, good answer, but you should mention that you're showing (I think) clang output (not gcc). I can tell because gcc doesn't unroll at all by default. — Peter Cordes, Nov 23 '18 at 05:47

SSE addition is slower than + operator

3 Answers3