Why is this SIMD multiplication not faster than non-SIMD multiplication?

Question

Let's assume that we have a function that multiplies two arrays of 1000000 doubles each. In C/C++ the function looks like this:

void mul_c(double* a, double* b)
{
    for (int i = 0; i != 1000000; ++i)
    {
        a[i] = a[i] * b[i];
    }
}

The compiler produces the following assembly with -O2:

mul_c(double*, double*):
        xor     eax, eax
.L2:
        movsd   xmm0, QWORD PTR [rdi+rax]
        mulsd   xmm0, QWORD PTR [rsi+rax]
        movsd   QWORD PTR [rdi+rax], xmm0
        add     rax, 8
        cmp     rax, 8000000
        jne     .L2
        rep ret

From the above assembly it seems that the compiler uses the SIMD instructions, but it only multiplies one double each iteration. So I decided to write the same function in inline assembly instead, where I make full use of the xmm0 register and multiply two doubles in one go:

void mul_asm(double* a, double* b)
{
    asm volatile
    (
        ".intel_syntax noprefix             \n\t"
        "xor    rax, rax                    \n\t"
        "0:                                 \n\t"
        "movupd xmm0, xmmword ptr [rdi+rax] \n\t"
        "mulpd  xmm0, xmmword ptr [rsi+rax] \n\t"
        "movupd xmmword ptr [rdi+rax], xmm0 \n\t"
        "add    rax, 16                     \n\t"
        "cmp    rax, 8000000                \n\t"
        "jne    0b                          \n\t"
        ".att_syntax noprefix               \n\t"

        : 
        : "D" (a), "S" (b)
        : "memory", "cc"
    );
}

After measuring the execution time individually for both of these functions, it seems that both of them takes 1 ms to complete:

> gcc -O2 main.cpp
> ./a.out < input

mul_c: 1 ms
mul_asm: 1 ms

[a lot of doubles...]

I expected the SIMD implementation to be atleast twice as fast (0 ms) as there is only half the amount of multiplications/memory instructions.

So my question is: Why isn't the SIMD implementation faster than the ordinary C/C++ implementation when the SIMD implementation only does half the amount of multiplications/memory instructions?

Here's the full program:

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>

void mul_c(double* a, double* b)
{
    for (int i = 0; i != 1000000; ++i)
    {
        a[i] = a[i] * b[i];
    }
}

void mul_asm(double* a, double* b)
{
    asm volatile
    (
        ".intel_syntax noprefix             \n\t"
        "xor    rax, rax                    \n\t"
        "0:                                 \n\t"
        "movupd xmm0, xmmword ptr [rdi+rax] \n\t"
        "mulpd  xmm0, xmmword ptr [rsi+rax] \n\t"
        "movupd xmmword ptr [rdi+rax], xmm0 \n\t"
        "add    rax, 16                     \n\t"
        "cmp    rax, 8000000                \n\t"
        "jne    0b                          \n\t"
        ".att_syntax noprefix               \n\t"

        : 
        : "D" (a), "S" (b)
        : "memory", "cc"
    );
}

int main()
{
    struct timeval t1;
    struct timeval t2;
    unsigned long long time;

    double* a = (double*)malloc(sizeof(double) * 1000000);
    double* b = (double*)malloc(sizeof(double) * 1000000);
    double* c = (double*)malloc(sizeof(double) * 1000000);

    for (int i = 0; i != 1000000; ++i)
    {
        double v;
        scanf("%lf", &v);
        a[i] = v;
        b[i] = v;
        c[i] = v;
    }

    gettimeofday(&t1, NULL);
    mul_c(a, b);
    gettimeofday(&t2, NULL);
    time = 1000 * (t2.tv_sec - t1.tv_sec) + (t2.tv_usec - t1.tv_usec) / 1000;
    printf("mul_c: %llu ms\n", time);

    gettimeofday(&t1, NULL);
    mul_asm(b, c);
    gettimeofday(&t2, NULL);
    time = 1000 * (t2.tv_sec - t1.tv_sec) + (t2.tv_usec - t1.tv_usec) / 1000;
    printf("mul_asm: %llu ms\n\n", time);

    for (int i = 0; i != 1000000; ++i)
    {
        printf("%lf\t\t\t%lf\n", a[i], b[i]);
    }

    return 0;
}

I also tried to to make use of all xmm registers (0-7) and remove instruction dependencies to get better parallell computing:

void mul_asm(double* a, double* b)
{
    asm volatile
    (
        ".intel_syntax noprefix                 \n\t"
        "xor    rax, rax                        \n\t"
        "0:                                     \n\t"
        "movupd xmm0, xmmword ptr [rdi+rax]     \n\t"
        "movupd xmm1, xmmword ptr [rdi+rax+16]  \n\t"
        "movupd xmm2, xmmword ptr [rdi+rax+32]  \n\t"
        "movupd xmm3, xmmword ptr [rdi+rax+48]  \n\t"
        "movupd xmm4, xmmword ptr [rdi+rax+64]  \n\t"
        "movupd xmm5, xmmword ptr [rdi+rax+80]  \n\t"
        "movupd xmm6, xmmword ptr [rdi+rax+96]  \n\t"
        "movupd xmm7, xmmword ptr [rdi+rax+112] \n\t"
        "mulpd  xmm0, xmmword ptr [rsi+rax]     \n\t"
        "mulpd  xmm1, xmmword ptr [rsi+rax+16]  \n\t"
        "mulpd  xmm2, xmmword ptr [rsi+rax+32]  \n\t"
        "mulpd  xmm3, xmmword ptr [rsi+rax+48]  \n\t"
        "mulpd  xmm4, xmmword ptr [rsi+rax+64]  \n\t"
        "mulpd  xmm5, xmmword ptr [rsi+rax+80]  \n\t"
        "mulpd  xmm6, xmmword ptr [rsi+rax+96]  \n\t"
        "mulpd  xmm7, xmmword ptr [rsi+rax+112] \n\t"
        "movupd xmmword ptr [rdi+rax], xmm0     \n\t"
        "movupd xmmword ptr [rdi+rax+16], xmm1  \n\t"
        "movupd xmmword ptr [rdi+rax+32], xmm2  \n\t"
        "movupd xmmword ptr [rdi+rax+48], xmm3  \n\t"
        "movupd xmmword ptr [rdi+rax+64], xmm4  \n\t"
        "movupd xmmword ptr [rdi+rax+80], xmm5  \n\t"
        "movupd xmmword ptr [rdi+rax+96], xmm6  \n\t"
        "movupd xmmword ptr [rdi+rax+112], xmm7 \n\t"
        "add    rax, 128                        \n\t"
        "cmp    rax, 8000000                    \n\t"
        "jne    0b                              \n\t"
        ".att_syntax noprefix                   \n\t"

        : 
        : "D" (a), "S" (b)
        : "memory", "cc"
    );
}

But it still runs at 1 ms, the same speed as the ordinary C/C++ implementation.

UPDATES

As suggested by answers/comments, I've implemented another way of measuring the execution time:

#include <stdio.h>
#include <stdlib.h>

void mul_c(double* a, double* b)
{
    for (int i = 0; i != 1000000; ++i)
    {
        a[i] = a[i] * b[i];
    }
}

void mul_asm(double* a, double* b)
{
    asm volatile
    (
        ".intel_syntax noprefix             \n\t"
        "xor    rax, rax                    \n\t"
        "0:                                 \n\t"
        "movupd xmm0, xmmword ptr [rdi+rax] \n\t"
        "mulpd  xmm0, xmmword ptr [rsi+rax] \n\t"
        "movupd xmmword ptr [rdi+rax], xmm0 \n\t"
        "add    rax, 16                     \n\t"
        "cmp    rax, 8000000                \n\t"
        "jne    0b                          \n\t"
        ".att_syntax noprefix               \n\t"

        : 
        : "D" (a), "S" (b)
        : "memory", "cc"
    );
}

void mul_asm2(double* a, double* b)
{
    asm volatile
    (
        ".intel_syntax noprefix                 \n\t"
        "xor    rax, rax                        \n\t"
        "0:                                     \n\t"
        "movupd xmm0, xmmword ptr [rdi+rax]     \n\t"
        "movupd xmm1, xmmword ptr [rdi+rax+16]  \n\t"
        "movupd xmm2, xmmword ptr [rdi+rax+32]  \n\t"
        "movupd xmm3, xmmword ptr [rdi+rax+48]  \n\t"
        "movupd xmm4, xmmword ptr [rdi+rax+64]  \n\t"
        "movupd xmm5, xmmword ptr [rdi+rax+80]  \n\t"
        "movupd xmm6, xmmword ptr [rdi+rax+96]  \n\t"
        "movupd xmm7, xmmword ptr [rdi+rax+112] \n\t"
        "mulpd  xmm0, xmmword ptr [rsi+rax]     \n\t"
        "mulpd  xmm1, xmmword ptr [rsi+rax+16]  \n\t"
        "mulpd  xmm2, xmmword ptr [rsi+rax+32]  \n\t"
        "mulpd  xmm3, xmmword ptr [rsi+rax+48]  \n\t"
        "mulpd  xmm4, xmmword ptr [rsi+rax+64]  \n\t"
        "mulpd  xmm5, xmmword ptr [rsi+rax+80]  \n\t"
        "mulpd  xmm6, xmmword ptr [rsi+rax+96]  \n\t"
        "mulpd  xmm7, xmmword ptr [rsi+rax+112] \n\t"
        "movupd xmmword ptr [rdi+rax], xmm0     \n\t"
        "movupd xmmword ptr [rdi+rax+16], xmm1  \n\t"
        "movupd xmmword ptr [rdi+rax+32], xmm2  \n\t"
        "movupd xmmword ptr [rdi+rax+48], xmm3  \n\t"
        "movupd xmmword ptr [rdi+rax+64], xmm4  \n\t"
        "movupd xmmword ptr [rdi+rax+80], xmm5  \n\t"
        "movupd xmmword ptr [rdi+rax+96], xmm6  \n\t"
        "movupd xmmword ptr [rdi+rax+112], xmm7 \n\t"
        "add    rax, 128                        \n\t"
        "cmp    rax, 8000000                    \n\t"
        "jne    0b                              \n\t"
        ".att_syntax noprefix                   \n\t"

        : 
        : "D" (a), "S" (b)
        : "memory", "cc"
    );
}

unsigned long timestamp()
{
    unsigned long a;

    asm volatile
    (
        ".intel_syntax noprefix \n\t"
        "xor   rax, rax         \n\t"
        "xor   rdx, rdx         \n\t"
        "RDTSCP                 \n\t"
        "shl   rdx, 32          \n\t"
        "or    rax, rdx         \n\t"
        ".att_syntax noprefix   \n\t"

        : "=a" (a)
        : 
        : "memory", "cc"
    );

    return a;
}

int main()
{
    unsigned long t1;
    unsigned long t2;

    double* a;
    double* b;

    a = (double*)malloc(sizeof(double) * 1000000);
    b = (double*)malloc(sizeof(double) * 1000000);

    for (int i = 0; i != 1000000; ++i)
    {
        double v;
        scanf("%lf", &v);
        a[i] = v;
        b[i] = v;
    }

    t1 = timestamp();
    mul_c(a, b);
    //mul_asm(a, b);
    //mul_asm2(a, b);
    t2 = timestamp();
    printf("mul_c: %lu cycles\n\n", t2 - t1);

    for (int i = 0; i != 1000000; ++i)
    {
        printf("%lf\t\t\t%lf\n", a[i], b[i]);
    }

    return 0;
}

When I run the program with this measurement, I get this result:

mul_c:    ~2163971628 cycles
mul_asm:  ~2532045184 cycles
mul_asm2: ~5230488    cycles <-- what???

Two things are worth a notice here, first of all, the cycles count vary a LOT, and I assume that's because of the operating system allowing other processes to run inbetween. Is there any way to prevent that or only count the cycles while my program is executed? Also, mul_asm2 produces identical output compared to the other two, but it so much faster, how?

I tried Z boson's program on my system together with my 2 implementations and got the following result:

> g++ -O2 -fopenmp main.cpp
> ./a.out
mul         time 1.33, 18.08 GB/s
mul_SSE     time 1.13, 21.24 GB/s
mul_SSE_NT  time 1.51, 15.88 GB/s
mul_SSE_OMP time 0.79, 30.28 GB/s
mul_SSE_v2  time 1.12, 21.49 GB/s
mul_v2      time 1.26, 18.99 GB/s
mul_asm     time 1.12, 21.50 GB/s
mul_asm2    time 1.09, 22.08 GB/s

Your timing calculations aren't precise enough for this kind of benchmark. Try running the code with the [Google Benchmark library](https://github.com/google/benchmark) and see what you find out. — Daniel Kamil Kozar, Mar 23 '17 at 00:04
You need more loop iterations to measure it better, use high resolution timer or use RDTSC/RDTSCP. 1ms you have is noise. — Anty, Mar 23 '17 at 00:15
Additionally use -O3 and you will have `mulpd xmm0, XMMWORD PTR [rcx+rax]` for C version. — Anty, Mar 23 '17 at 00:22
There are 16 XMM registers in 64-bit mode. You're changing the values of registers in your inline assembly with out telling the compiler so the code it generates might end being garbage. — Ross Ridge, Mar 23 '17 at 00:45
@Jester Good point. How can I let these functions work without being bottlenecked by memory? — fighting_falcon93, Mar 23 '17 at 02:41
@RossRidge Do you mean that I should add these registers to the clobbers list? Also, is there any way to tell the compiler that I've used all `xmm` registers or must I specify them one-by-one? And can I make it run even faster if I make each iteration load/mul/store 16 registers at a time or will it be as fast as just using less registers? — fighting_falcon93, Mar 23 '17 at 02:44
@DanielKamilKozar Sounds interesting, I checked out the link although the only problem is that I have very little experience using Github (actually non-existant, I know, ouch), so I don't know how I should use it and which folders I should download? — fighting_falcon93, Mar 23 '17 at 02:46
You need add RAX and each individual XMM register you modify in the clobbers. Using all 16 XMM registers probably won't make a difference because your probably memory bound. — Ross Ridge, Mar 23 '17 at 03:55
@CoryNelson, I tested the OPs code (see my answer) and I see a significant improvement with vectorization. I don't know why. The total memory read and written should be 8*100000*(2+1) = 24 MB (two reads one write). So I agree it should be fairly memory bandwidth bound. — Z boson, Mar 23 '17 at 13:33
What is your CPU generation? which platform do you use? BTW if your CPU supports `AVX` this might make `SSE/AVX` transition penalty. another point: Use intrinsics. It might be the problem inline assembly might not be faster than intrinsics in [this paper](linkinghub.elsevier.com/retrieve/pii/S0141933116302502) — Amiri, Mar 23 '17 at 18:25
@RossRidge You're right. When I tried Z boson's program with my `mul_asm` function it got stuck in the loop. As it turns out the compiler was using `eax` as loop counter, while I used `rax` for address offset, which made the loop do (much) more iterations than it should. Now it's fixed ;) — fighting_falcon93, Mar 24 '17 at 02:01
@FackedDeveloper The system I'm working on right now has an Intel Xeon X3470, and since the command `lscpu` only lists `sse4_1` and `sse4_2` and not `avx` or `avx2` I'd assume it has no `AVX` support. The system I'll run it on later has an Intel Xeon E5-2660, and it has support for `AVX2` but not `AVX-512`. The main reason I use inline assembly over intrinsics is because I want to learn to use `SIMD` instructions in assembly. — fighting_falcon93, Mar 24 '17 at 02:20
@CoryNelson, the reason I saw a big improvement with vectors was due to [a bug in the timing function I was using](https://stackoverflow.com/questions/43256496/avx-scalar-operations-are-much-faster/43277581#43277581). — Z boson, Apr 10 '17 at 12:06

score 11 · Accepted Answer · edited May 23 '17 at 12:26

There was a major bug in the timing function I used for previous benchmarks. This grossly underestimated the bandwidth without vectorization as well as other measurements. Additionally, there was another problem that was overestimating the bandwidth due to COW on the array that was read but not written to. Finally, the maximum bandwidth I used was incorrect. I have updated my answer with the corrections and I have left the old answer at the end of this answer.

Your operation is memory bandwidth bound. This means the CPU is spending most of its time waiting on slow memory reads and writes. An excellent explanation for this can be found here: Why vectorizing the loop does not have performance improvement.

However, I have to disagree slightly with one statement in that answer.

So regardless of how it's optimized, (vectorized, unrolled, etc...) it isn't gonna get much faster.

In fact, vectorization~~, unrolling,~~ and multiple threads can significantly increase the bandwidth even in memory bandwidth bound operations. The reason is that it is difficult to obtain the maximum memory bandwidth. A good explanation for this can be found here: https://stackoverflow.com/a/25187492/2542702.

The rest of my answer will show how vectorization and multiple threads can get closer to the maximum memory bandwidth.

My test system: Ubuntu 16.10, Skylake (i7-6700HQ@2.60GHz), 32GB RAM, dual channel DDR4@2400 GHz. The maximum bandwidth from my system is 38.4 GB/s.

From the code below I produce the following tables. I set the number of thread using OMP_NUM_THREADS e.g. export OMP_NUM_THREADS=4. The efficiency is bandwidth/max_bandwidth.

-O2 -march=native -fopenmp
Threads Efficiency
1       59.2%
2       76.6%
4       74.3%
8       70.7%

-O2 -march=native -fopenmp -funroll-loops
1       55.8%
2       76.5%
4       72.1%
8       72.2%

-O3 -march=native -fopenmp
1       63.9%
2       74.6%
4       63.9%
8       63.2%

-O3 -march=native -fopenmp -mprefer-avx128
1       67.8%
2       76.0%
4       63.9%
8       63.2%

-O3 -march=native -fopenmp -mprefer-avx128 -funroll-loops
1       68.8%
2       73.9%
4       69.0%
8       66.8%

After several iterations of running due to uncertainties in the measurements I have formed the following conclusions:

single threaded scalar operations get more than 50% of the bandwidth.
two threaded scalar operations get the highest bandwidth.
single threaded vector operations are faster than single threaded scalar operations.
single threaded SSE operations are faster than single threaded AVX operations.
unrolling is not helpful.
unrolling single-threaded operations is slower than without unrolling.
more threads than cores (Hyper-Threading) gives a lower bandwidth.

The solution that gives the best bandwidth is scalar operations with two threads.

The code I used to benchmark:

#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <omp.h>

#define N 10000000
#define R 100

void mul(double *a, double *b) {
  #pragma omp parallel for
  for (int i = 0; i<N; i++) a[i] *= b[i];
}

int main() {
  double maxbw = 2.4*2*8; // 2.4GHz * 2-channels * 64-bits * 1-byte/8-bits 
  double mem = 3*sizeof(double)*N*R*1E-9; // GB

  double *a = (double*)malloc(sizeof *a * N);
  double *b = (double*)malloc(sizeof *b * N);

  //due to copy-on-write b must be initialized to get the correct bandwidth
  //also, GCC will convert malloc + memset(0) to calloc so use memset(1)
  memset(b, 1, sizeof *b * N);

  double dtime = -omp_get_wtime();
  for(int i=0; i<R; i++) mul(a,b);
  dtime += omp_get_wtime();
  printf("%.2f s, %.1f GB/s, %.1f%%\n", dtime, mem/dtime, 100*mem/dtime/maxbw);

  free(a), free(b);
}

The old solution with the timing bug

The modern solution for inline assembly is to use intrinsics. There are still cases where one needs inline assembly but this is not one of them.

One intrinsics solution for you inline assembly approach is simply:

void mul_SSE(double*  a, double*  b) {
  for (int i = 0; i<N/2; i++) 
      _mm_store_pd(&a[2*i], _mm_mul_pd(_mm_load_pd(&a[2*i]),_mm_load_pd(&b[2*i])));
}

Let me define some test code

#include <x86intrin.h>
#include <string.h>
#include <stdio.h>
#include <x86intrin.h>
#include <omp.h>

#define N 1000000
#define R 1000

typedef __attribute__(( aligned(32)))  double aligned_double;
void  (*fp)(aligned_double *a, aligned_double *b);

void mul(aligned_double* __restrict a, aligned_double* __restrict b) {
  for (int i = 0; i<N; i++) a[i] *= b[i];
}

void mul_SSE(double*  a, double*  b) {
  for (int i = 0; i<N/2; i++) _mm_store_pd(&a[2*i], _mm_mul_pd(_mm_load_pd(&a[2*i]),_mm_load_pd(&b[2*i])));
}

void mul_SSE_NT(double*  a, double*  b) {
  for (int i = 0; i<N/2; i++) _mm_stream_pd(&a[2*i], _mm_mul_pd(_mm_load_pd(&a[2*i]),_mm_load_pd(&b[2*i])));
}

void mul_SSE_OMP(double*  a, double*  b) {
  #pragma omp parallel for
  for (int i = 0; i<N; i++) a[i] *= b[i];
}

void test(aligned_double *a, aligned_double *b, const char *name) {
  double dtime;
  const double mem = 3*sizeof(double)*N*R/1024/1024/1024;
  const double maxbw = 34.1;
  dtime = -omp_get_wtime();
  for(int i=0; i<R; i++) fp(a,b);
  dtime += omp_get_wtime();
  printf("%s \t time %.2f s, %.1f GB/s, efficency %.1f%%\n", name, dtime, mem/dtime, 100*mem/dtime/maxbw);
}

int main() {
  double *a = (double*)_mm_malloc(sizeof *a * N, 32);
  double *b = (double*)_mm_malloc(sizeof *b * N, 32);

  //b must be initialized to get the correct bandwidth!!!
  memset(a, 1, sizeof *a * N);
  memset(b, 1, sizeof *a * N);

  fp = mul,         test(a,b, "mul        ");
  fp = mul_SSE,     test(a,b, "mul_SSE    ");
  fp = mul_SSE_NT,  test(a,b, "mul_SSE_NT ");
  fp = mul_SSE_OMP, test(a,b, "mul_SSE_OMP");

  _mm_free(a), _mm_free(b);
}

Now the first test

g++ -O2 -fopenmp test.cpp
./a.out
mul              time 1.67 s, 13.1 GB/s, efficiency 38.5%
mul_SSE          time 1.00 s, 21.9 GB/s, efficiency 64.3%
mul_SSE_NT       time 1.05 s, 20.9 GB/s, efficiency 61.4%
mul_SSE_OMP      time 0.74 s, 29.7 GB/s, efficiency 87.0%

So with -O2 which does not vectorize loops we see that the intrinsic SSE version is much faster than the plain C solution mul. efficiency = bandwith_measured/max_bandwidth where the max is 34.1 GB/s for my system.

Second test

g++ -O3 -fopenmp test.cpp
./a.out
mul              time 1.05 s, 20.9 GB/s, efficiency 61.2%
mul_SSE          time 0.99 s, 22.3 GB/s, efficiency 65.3%
mul_SSE_NT       time 1.01 s, 21.7 GB/s, efficiency 63.7%
mul_SSE_OMP      time 0.68 s, 32.5 GB/s, efficiency 95.2%

With -O3 vectorizes the loop and the intrinsic function offers essentially no advantage.

Third test

g++ -O3 -fopenmp -funroll-loops test.cpp
./a.out
mul              time 0.85 s, 25.9 GB/s, efficency 76.1%
mul_SSE          time 0.84 s, 26.2 GB/s, efficency 76.7%
mul_SSE_NT       time 1.06 s, 20.8 GB/s, efficency 61.0%
mul_SSE_OMP      time 0.76 s, 29.0 GB/s, efficency 85.0%

With -funroll-loops GCC unrolls the loops eight times and we see a significant improvement except for the non-temporal store solution and not real advantage for OpenMP solution.

Before unrolling the loop the assembly for mul wiht -O3 is

    xor     eax, eax
.L2:
    movupd  xmm0, XMMWORD PTR [rsi+rax]
    mulpd   xmm0, XMMWORD PTR [rdi+rax]
    movaps  XMMWORD PTR [rdi+rax], xmm0
    add     rax, 16
    cmp     rax, 8000000
    jne     .L2
    rep ret

With -O3 -funroll-loops the assembly for mul is:

   xor     eax, eax
.L2:
    movupd  xmm0, XMMWORD PTR [rsi+rax]
    movupd  xmm1, XMMWORD PTR [rsi+16+rax]
    mulpd   xmm0, XMMWORD PTR [rdi+rax]
    movupd  xmm2, XMMWORD PTR [rsi+32+rax]
    mulpd   xmm1, XMMWORD PTR [rdi+16+rax]
    movupd  xmm3, XMMWORD PTR [rsi+48+rax]
    mulpd   xmm2, XMMWORD PTR [rdi+32+rax]
    movupd  xmm4, XMMWORD PTR [rsi+64+rax]
    mulpd   xmm3, XMMWORD PTR [rdi+48+rax]
    movupd  xmm5, XMMWORD PTR [rsi+80+rax]
    mulpd   xmm4, XMMWORD PTR [rdi+64+rax]
    movupd  xmm6, XMMWORD PTR [rsi+96+rax]
    mulpd   xmm5, XMMWORD PTR [rdi+80+rax]
    movupd  xmm7, XMMWORD PTR [rsi+112+rax]
    mulpd   xmm6, XMMWORD PTR [rdi+96+rax]
    movaps  XMMWORD PTR [rdi+rax], xmm0
    mulpd   xmm7, XMMWORD PTR [rdi+112+rax]
    movaps  XMMWORD PTR [rdi+16+rax], xmm1
    movaps  XMMWORD PTR [rdi+32+rax], xmm2
    movaps  XMMWORD PTR [rdi+48+rax], xmm3
    movaps  XMMWORD PTR [rdi+64+rax], xmm4
    movaps  XMMWORD PTR [rdi+80+rax], xmm5
    movaps  XMMWORD PTR [rdi+96+rax], xmm6
    movaps  XMMWORD PTR [rdi+112+rax], xmm7
    sub     rax, -128
    cmp     rax, 8000000
    jne     .L2
    rep ret

Fourth test

g++ -O3 -fopenmp -mavx test.cpp
./a.out
mul              time 0.87 s, 25.3 GB/s, efficiency 74.3%
mul_SSE          time 0.88 s, 24.9 GB/s, efficiency 73.0%
mul_SSE_NT       time 1.07 s, 20.6 GB/s, efficiency 60.5%
mul_SSE_OMP      time 0.76 s, 29.0 GB/s, efficiency 85.2%

Now the non-intrinsic function is the fastest (excluding the OpenMP version).

So there is no reason to use intrinsics or inline assembly in this case because we can get the best performance with appropriate compiler options (e.g. -O3, -funroll-loops, -mavx).

Test system: Ubuntu 16.10, Skylake (i7-6700HQ@2.60GHz), 32GB RAM. Maximum memory bandwidth (34.1 GB/s) https://ark.intel.com/products/88967/Intel-Core-i7-6700HQ-Processor-6M-Cache-up-to-3_50-GHz

Here is another solution worth considering. The cmp instruction is not necessary if we count from -N up to zero and access the arrays as N+i. GCC should have fixed this a long time ago. It eliminates one instruction (though due to macro-op fusion the cmp and jmp often count as one micro-op).

void mul_SSE_v2(double*  a, double*  b) {
  for (ptrdiff_t i = -N; i<0; i+=2)
    _mm_store_pd(&a[N + i], _mm_mul_pd(_mm_load_pd(&a[N + i]),_mm_load_pd(&b[N + i])));

Assembly with -O3

mul_SSE_v2(double*, double*):
    mov     rax, -1000000
.L9:
    movapd  xmm0, XMMWORD PTR [rdi+8000000+rax*8]
    mulpd   xmm0, XMMWORD PTR [rsi+8000000+rax*8]
    movaps  XMMWORD PTR [rdi+8000000+rax*8], xmm0
    add     rax, 2
    jne     .L9
    rep ret
}

This optimization will only possibly be helpful the arrays fit e.g. the L1 cache i.e. not reading from main memory.

I finally found a way to get the plain C solution to not generate the cmp instruction.

void mul_v2(aligned_double* __restrict a, aligned_double* __restrict b) {
  for (int i = -N; i<0; i++) a[i] *= b[i];
}

And then call the function from a separate object file like this mul_v2(&a[N],&b[N]) so this is perhaps the best solution. However, if you call the function from the same object file (translation unit) as the one it's defined in the GCC generates the cmp instruction again.

Also,

void mul_v3(aligned_double* __restrict a, aligned_double* __restrict b) {
  for (int i = -N; i<0; i++) a[N+i] *= b[N+i];
}

still generates the cmp instruction and generates the same assembly as the mul function.

The function mul_SSE_NT is silly. It uses non-temporal stores which are only useful when only writing to memory but since the function reads and writes to the same address non-temporal stores are not only useless they give inferior results.

Previous versions of this answer were getting the wrong bandwidth. The reason was when the arrays were not initialized.

I tried your program on my system together with my 2 implementations and I added the result to the opening question. I like this answer very much as it's very detailed and also provides code and measurements as comparison, although before I accept this as an answer, I'd just like a clarification on the question itself. How come that the ordinary C/C++ implementation runs at 1.33 (on my system), while the SIMD implementation runs at 1.09? Is this because it's memory bound, and if yes, how does one know when your program is memory bound? Are there any ways to optimize this? — fighting_falcon93, Mar 24 '17 at 01:46
Also, how can `mul_SSE_OMP` be so fast? I put the functions code into godbolt.org but it produces the same assembly code as my `mul_asm` function, with the only exception that it uses `movapd` and `movaps` instead of `movupd`. I tried to change my function to use these instructions aswell but there was no difference in speed? — fighting_falcon93, Mar 24 '17 at 01:52
@fighting_falcon93, as I explained in test 2 if you compile with `-O3` the ordinary C implementation is as fast as the SSE version because they both use SIMD. — Z boson, Mar 25 '17 at 10:55
@fighting_falcon93, Maybe you mean to ask how the SIMD version is faster than the scalar version if the function is memory bandwidth bound? That's a good question. Many people here immediately said it's memory bandwidth bound so the SIMD version should not be faster than the scalar version. That's not what I see. Being bound by memory bandwidth is not necessarily saying what you see is the maximum memory bandwidth. I get closest to the maximum memory bandwidth using SIMD and multiple threads (using OpenMP). See my updated answer at the end. — Z boson, Mar 25 '17 at 10:57
I'm sorry, maybe I was a bit unclear. What I meant was that when compiling with `-O2` the compiler only puts 1 `double` in the `xmm0` register each iteration. This causes it to make twice as many iterations, twice as many multiplications and twice as many memory operations, but even so, it's still not twice as slow compared to my implementation in inline assembly. I know that `-O3` optimizes this, but the questions main focus is around `-O2` and why the speed difference is so small even though my implementation does only half of the work? — fighting_falcon93, Mar 25 '17 at 16:13
@fighting_falcon93, because your operations is memory bandwidth bound so it does not scale with the number of SIMD lanes or number of threads. However, it still can benefit from multiple threads, unrolling, and SIMD. That's the part that most people don't appreciate. I updated my answer from the beginning with more details. — Z boson, Mar 27 '17 at 07:44
@fighting_falcon93 I forgot to answer you question about OpenMP. If you compile with `-fopenmp` you will see `call GOMP_parallel` and other code so the OpenMP assembly is not the same as without https://godbolt.org/g/yZkH23. — Z boson, Mar 28 '17 at 11:05
@fighting_falcon93, bad news, I had `double mem = 2*sizeof(double)*N*R*1E-9;` instead of `double mem = 3*sizeof(double)*N*R*1E-9;` in my last test of bandwidth which means my numbers were about 30% lower than they should be. With the correct value I get sometimes up to 120% of the max bandwidth. That's obviously wrong. I don't know where the problem is. It's either in my formula or the max bandwidth is higher than 34.1 GB/s, or I am doing something wrong with benchmarking. Benchmarking is a pain! But getting more than 100% needs to be understood. — Z boson, Mar 29 '17 at 12:28
@fighting_falcon93, okay `memset` makes sense but `memcpy` has twice the bandwidth that I expect. I must have forgot something. This leads me to think that I had it correct in my last test. — Z boson, Mar 29 '17 at 12:48
@fighting_falcon93, okay, I think I found the problem. I have to initialize the arrays between each test. When I do that `memcpy` gives the correct bandwidth. I will have to update this tomorrow. — Z boson, Mar 30 '17 at 14:06
@fighting_falcon93, I fixed my answer. The problem was that I was using uninitialized arrays. `memset(b, 1, sizeof *a * N)` fixed it! I rewrote the code. It's just one file now and much cleaner. I cleaned up the rest of my answer. I am happy with it now. — Z boson, Mar 31 '17 at 12:36
There is another problem. Vex-econded scalar operations are much faster than non-vex-encoded scalar operations https://stackoverflow.com/questions/43256496/avx-scalar-operations-are-much-faster. This makes no sense. I will have to update my answer again (or make a new one) when I figure this out. Benchmarking is a PITA!!! — Z boson, Apr 06 '17 at 13:31
Interesting. I'll take a look into it aswell, thank you for the info :) — fighting_falcon93, Apr 06 '17 at 15:32
@fighting_falcon93, okay, I updated my answer again with the timing correction. Let me know what you think. I learned a lot from this question. — Z boson, Apr 10 '17 at 11:10
Very interesting, thanks for the research done and the detailed answer. I'll experiment further on the things you mentioned to see how I can improve the running time on my machine. — fighting_falcon93, Apr 10 '17 at 14:16

score 9 · Answer 2 · answered Mar 23 '17 at 00:48

9

Your asm code is really OK. What is not is the way you measure it. As I pointed in comments you should:

a) use way more iterations - 1 million is nothing for modern CPU

b) use HPT for measurement

c) use RDTSC or RDTSCP to count real CPU clocks

Additionally why you are afraid of -O3 opt? Don't forget to build code for your platform so use -march=native. If your CPU supports AVX or AVX2 compiler will take opportunity to produce even better code.

Next thing - give compiler some hints about aliasing and allignment if you know you code.

Here is my version of your mul_c - yes it is GCC specific but you showed you used GCC

void mul_c(double* restrict a, double* restrict b)
{
   a = __builtin_assume_aligned (a, 16);
   b = __builtin_assume_aligned (b, 16);

    for (int i = 0; i != 1000000; ++i)
    {
        a[i] = a[i] * b[i];
    }
}

It will produce:

mul_c(double*, double*):
        xor     eax, eax
.L2:
        movapd  xmm0, XMMWORD PTR [rdi+rax]
        mulpd   xmm0, XMMWORD PTR [rsi+rax]
        movaps  XMMWORD PTR [rdi+rax], xmm0
        add     rax, 16
        cmp     rax, 8000000
        jne     .L2
        rep ret

If you have AVX2 and make sure data is 32 bytes aligned it will become

mul_c(double*, double*):
        xor     eax, eax
.L2:
        vmovapd ymm0, YMMWORD PTR [rdi+rax]
        vmulpd  ymm0, ymm0, YMMWORD PTR [rsi+rax]
        vmovapd YMMWORD PTR [rdi+rax], ymm0
        add     rax, 32
        cmp     rax, 8000000
        jne     .L2
        vzeroupper
        ret

So no need for handcrafted asm if compiler can do it for you ;)

answered Mar 23 '17 at 00:48

Anty

1,486
1
9
12

I've tried to measure the running time with RDTSCP instead, I updated my question with the new code and results. As I wrote in the update, the amount of cycles vary a lot, propably because the operating system runs other processes inbetween. Is there any way to only count the cycles during my program? Also, how come `mul_asm2` is so quick when counting cycles? The reason I don't use `-O3` is because the system I'll run the code on later doesn't allow me to specify flags, and it uses `-O2`, otherwise I would have used `-O3` ;) Also, thanks for the tip, I didn't know such hints was possible. – fighting_falcon93 Mar 23 '17 at 02:29
Also, the system I'll run it on later has support for AVX2, but not the system I'm working on right now, so that's why I'm only using 128-bit (XMM) registers right now. I'll change it to 256-bit registers (YMM) later. Would have been cool to use AVX-512 with 512-bit registers (ZMM) but neither of the two systems supports it :'( – fighting_falcon93 Mar 23 '17 at 02:52
@fighting_falcon93 the point of fixing the C source instead of writing asm is, that you can just compile for both systems without change of source (on yours it will compile without AVX2, on target it will compile with AVX2 (if proper compile time switches are used)). So why are you still fixing the asm, if the C is enough to produce the optimal vectorized code? – Ped7g Mar 23 '17 at 04:25
1

@Ped7g Mainly because I want to learn. I think it's fun to write assembly and beat the compiler, and very often I notice that the compiler do silly things that isn't fully optimized. I do a lot of programming where performance is very important, where each millisecond less is better and you want your code to run as fast as it possibly can, for instance in games and when competing against others with who has the faster code on sites like CodeChef etc. So I'm trying to find new ways to push the performance of my implementations to the limit :) – fighting_falcon93 Mar 23 '17 at 07:52

score 4 · Answer 3 · answered Mar 23 '17 at 08:51

4

I want to add another point of view to the problem. SIMD instructions give big performance boost if there is no memory bound restrictions. But there are too much memory loading and storing operations and too few CPU calculations in current example. So CPU is in time to process incoming data without using SIMD. If you use data of another type (32-bit float for example) or more complex algorithm, memory throughput won't restrict CPU performance and using of SIMD will give more advantages.

answered Mar 23 '17 at 08:51

ErmIg

3,980
1
27
40

That was my initial though: memory bandwidth bound. But in my tests I still see a significant improvement with vectorization for N=1000000 (2 double arrays so 16 MB). – Z boson Mar 23 '17 at 10:01
Well, taking the loop unrolling experiment in the OP (the last experiment) into accont, I think we can conclude that the CPU simply was not able to perform all memory fetches in parallel that are physically possible. So, the OP *has* hit the memory barrier, just not in terms of throughput, but in terms of latency. – cmaster - reinstate monica Mar 23 '17 at 11:35
@Ermlg Good point. Is there any way of knowing for sure that the implementation is memory bound? Or any other kind of bounds, for instance branch-misprediction bound or input/output bound? – fighting_falcon93 Mar 24 '17 at 02:14

Why is this SIMD multiplication not faster than non-SIMD multiplication?

UPDATES

3 Answers3

The old solution with the timing bug

Linked

Related