Demonstrator code failing to show 4 times faster SIMD speed with optimization disabled

Question

I am trying to understand the benefit of using SIMD vectorization and wrote a simple demonstrator code to see what would be the speed gain of an algorithm leveraging vectorization (SIMD) over another. Here are the 2 algorithms:

Alg_A - No Vector support:

#include <stdio.h>

#define SIZE 1000000000

int main() {
    printf("Algorithm with NO vector support\n");

    int a[] = {1, 2, 3, 4};
    int b[] = {5, 6, 7, 8};
    int i = 0;

    printf("Running loop %d times\n", SIZE);
    for (; i < SIZE; i++) {
        a[0] = a[0] + b[0];
        a[1] = a[1] + b[1];
        a[2] = a[2] + b[2];
        a[3] = a[3] + b[3];
    }

    printf("A: [%d %d %d %d]\n", a[0], a[1], a[2], a[3]);
}

Alg_B - With Vector support:

#include <stdio.h>

#define SIZE 1000000000

typedef int v4_i __attribute__ ((vector_size(16)));
union Vec4 {
    v4_i v;
    int i[4];
};

int main() {
    printf("Algorithm with vector support\n\n");

    union Vec4 a, b;
    a.i[0] = 1, a.i[1] = 2, a.i[2] = 3, a.i[3] = 4;
    b.i[0] = 5, b.i[1] = 5, b.i[2] = 7, b.i[3] = 8;
    int i = 0;
    printf("Running loop %d times\n", SIZE);
    for (; i < SIZE; i++) {
        a.v = a.v + b.v;
    }

    printf("A: [%d %d %d %d]\n", a.i[0], a.i[1], a.i[2], a.i[3]);
}

The compilation was done as follows:

Alg_A :

gcc -ggdb -mno-sse -mno-sse2 -mno-sse3 -mno-sse4 -mno-sse4.1 -mno-sse4.2 -c non_vector_support.c
gcc non_vector_support.o -o non_vector_support

Alg_B :

gcc -ggdb -c vector_support.c
gcc vector_support.o -o vector_support

Looking at the generated code for both algorithms, I can see that the compilation did not do any tricks like 'auto-vectorization' (e.g. could not see xmm registers):

Alg_A :

    for (; i < SIZE; i++) {
  74:   eb 30                   jmp    a6 <main+0xa6>
        a[0] = a[0] + b[0];
  76:   8b 55 d8                mov    -0x28(%rbp),%edx
  79:   8b 45 e8                mov    -0x18(%rbp),%eax
  7c:   01 d0                   add    %edx,%eax
  7e:   89 45 d8                mov    %eax,-0x28(%rbp)
        a[1] = a[1] + b[1];
  81:   8b 55 dc                mov    -0x24(%rbp),%edx
  84:   8b 45 ec                mov    -0x14(%rbp),%eax
  87:   01 d0                   add    %edx,%eax
  89:   89 45 dc                mov    %eax,-0x24(%rbp)
        a[2] = a[2] + b[2];
  8c:   8b 55 e0                mov    -0x20(%rbp),%edx
  8f:   8b 45 f0                mov    -0x10(%rbp),%eax
  92:   01 d0                   add    %edx,%eax
  94:   89 45 e0                mov    %eax,-0x20(%rbp)
        a[3] = a[3] + b[3];
  97:   8b 55 e4                mov    -0x1c(%rbp),%edx
  9a:   8b 45 f4                mov    -0xc(%rbp),%eax
  9d:   01 d0                   add    %edx,%eax
  9f:   89 45 e4                mov    %eax,-0x1c(%rbp)
    int a[] = {1, 2, 3, 4};
    int b[] = {5, 6, 7, 8};
    int i = 0;

    printf("Running loop %d times\n", SIZE);
    for (; i < SIZE; i++) {
  a2:   83 45 d4 01             addl   $0x1,-0x2c(%rbp)
  a6:   81 7d d4 ff c9 9a 3b    cmpl   $0x3b9ac9ff,-0x2c(%rbp)
  ad:   7e c7                   jle    76 <main+0x76>
        a[1] = a[1] + b[1];
        a[2] = a[2] + b[2];
        a[3] = a[3] + b[3];
    }

    printf("A: [%d %d %d %d]\n", a[0], a[1], a[2], a[3]);

Alg_B :

    for (; i < SIZE; i++) {
  74:   eb 16                   jmp    8c <main+0x8c>
        a.v = a.v + b.v;
  76:   66 0f 6f 4d d0          movdqa -0x30(%rbp),%xmm1
  7b:   66 0f 6f 45 e0          movdqa -0x20(%rbp),%xmm0
  80:   66 0f fe c1             paddd  %xmm1,%xmm0
  84:   0f 29 45 d0             movaps %xmm0,-0x30(%rbp)
    union Vec4 a, b;
    a.i[0] = 1, a.i[1] = 2, a.i[2] = 3, a.i[3] = 4;
    b.i[0] = 5, b.i[1] = 5, b.i[2] = 7, b.i[3] = 8;
    int i = 0;
    printf("Running loop %d times\n", SIZE);
    for (; i < SIZE; i++) {
  88:   83 45 cc 01             addl   $0x1,-0x34(%rbp)
  8c:   81 7d cc ff c9 9a 3b    cmpl   $0x3b9ac9ff,-0x34(%rbp)
  93:   7e e1                   jle    76 <main+0x76>
        a.v = a.v + b.v;
    }

    printf("A: [%d %d %d %d]\n", a.i[0], a.i[1], a.i[2], a.i[3]);

And when I run the programs, I was hoping to see an improvement in speed by a factor of 4 however, the gain appears to be on average =~ 1s for this size of data and if increased the SIZE to around 8000000000 the gain is =~ 5s. This is the timing for the value in the above code:

Alg_A :

Algorithm with NO vector support
Running loop 1000000000 times
A: [705032705 1705032706 -1589934589 -589934588]

real    0m3.279s
user    0m3.282s
sys     0m0.000s

Alg_B :

Algorithm with vector support

Running loop 1000000000 times
A: [705032705 705032706 -1589934589 -589934588]

real    0m2.609s
user    0m2.607s
sys     0m0.004s

Counting the overhead associated to the loop. I ran the an empty loop for the given SIZE and obtained =~ 2.2s on avg. Which gives me an average speed up of around 2.5 times.

Have i missed something in the code or compilation lines? Or, is this suppose to be correct and in which case would someone know why isn't there a gain in 4 times in speed if I am exploiting 4 data points and 1 instruction at each iteration?

Thanks in advance.

I would suggest compiling your code with `-O2`, I believe this should give an almost infinite speedup... Doing speed test with `-O0` is always questionable. — cmaster - reinstate monica, Nov 22 '14 at 09:46
@cmaster: but at higher optimization levels you are effectively testing the compiler optimization, instead of the algorithm. Suppose the simultaneous additions are vectorized in optimized code; then there would be *no* difference between the 2 programs. — Jongware, Nov 22 '14 at 10:47
@cmaster The question i asked is not about optimizing code (in general). The question is quite specific: i want to observe the impact of SIMD. One that exploits parallelism in data (i.e. Alg_B) and one that does not (Alg_A). Hopefully, this guides you better towards helping me with understanding things :) — Gros Lalo, Nov 22 '14 at 11:26
As @cmaster says, disabling compiler optimisation makes your test meaningless. — Paul R, Nov 22 '14 at 16:21
@PaulR I am quite new to this thing but, what would be the relationship between -O2 and SIMD. Why would you want to compile Alg_A code with -O2 in order to observe the impact of SIMD? Could you tell a bit more? — Gros Lalo, Nov 22 '14 at 21:42
Sure - without compiler optimisations there will be many redundant operations in both cases, because you get no peephole optimisation of the generated code, or other high level optimisations. There will be redundant loads and stores and poor register allocation. You might as well be comparing assembler code written by two noobs who have been on a drinking spree (except that it's functionally correct). — Paul R, Nov 22 '14 at 21:47
@PaulR Wouldn't the systematic error in measurement induced by those redundant operations cancel each other when comparing the timing of both runs? In fact, just to try it out i added -O2 and in Alg_A i got 0.002s and for same size of data i got 0.406s for Alg_B which does not help me with comparing the +eve impact of SIMD. Looks like leveraging SIMD is bad. Is that so? — Gros Lalo, Nov 22 '14 at 22:02
The point about compiler optimizations is, that you (hopefully) never deploy an application that's compiled without optimizations. It would be a pointless waste of energy and time. Also, unoptimized code is *insane*. Just look at what your compiler generated: it does not even try to use the registers efficiently. When you make a time measurement, you do it for a purpose, and that purpose is (hopefully) deployment. Measuring unoptimized code is an academic exercise at best, and downright misleading at worst. — cmaster - reinstate monica, Nov 22 '14 at 22:10
@cmaster I understand very well and this is purely an academic exercise. As i mentioned it is a demonstrator code for a very specific purpose and this is why any other kind of "advantage" optimization flags were omitted from the compilation instructions. — Gros Lalo, Nov 22 '14 at 22:22
@GrosLalo: no, the generated code will be very different and the redundancies will not "cancel out". As cmaster says, unoptimised code is insane - use it only for debugging, not anything where you care about performance or taking meaningful timing measurements. And no, leveraging SIMD is not bad, but it is non-trivial to do it right, so that you get close to theoretical best case throughput improvement. Try reading some of the SIMD posts here on SO to see what is possible in practice. — Paul R, Nov 22 '14 at 23:54
@PaulR Are you saying that the experiment setup above is not good enough for one to understand how SIMD works? If it is not good then would it be possible for you to suggest an answer with a proper code and compilation instruction as this is the kind of help i am after. I am not looking at production code or, other optimization flags this time i am interested in an academic exercise as mentioned above. Thanks in advance. — Gros Lalo, Nov 23 '14 at 08:34
I would say not, since the whole point of SIMD is its performance relative to non-vectorised code. You need to be able to take meaningful timing measurements for both cases for comparison and you can't do this without compiler optimisations enabled, as already explained above. — Paul R, Nov 23 '14 at 08:47
@PaulR This is the part i don't understand: i tried with -O2 for both compilation and got 0.002s for Alg_A (**non**-vectorized one) and 0.406s for Alg_B (vectorized one). So, enabling optimization as you say does not help with understanding how SIMD is better. Furthermore, the problem is setup so that parallelization in Data makes it in favour of SIMD. In fact, i would argue that it is possibly the simplest formulation of the problem to highlight the benefit of SIMD: e.g. there is no dependencies of variables among instructions. Without -O2 i do get the a good speed up (2.5x) but not 4x. — Gros Lalo, Nov 23 '14 at 08:57
Your scalar loop is probably being optimised away in the -O2 case - you need to check the generated code to see what's going on. Compilers are quite clever and you may need to take steps to prevent redundant timing loops etc from being optimised away. Once you have resolved that you should get meaningful results and you can then start to dig deeper into the more subtle aspects of vectorisation. — Paul R, Nov 23 '14 at 09:20
@GrosLalo Listen to Paul R. It is absolutely pointless to benchmark without optimizations because the code is often artificially made slow to make it easy to compile and debuggable. If you want meaningful results, you need to turn on optimizations. That means you may need to play tricks to prevent the compiler from doing things like removing an entire useless loop. Micro-benchmarking is not easy. I'm surprised that this question has not been downvoted into oblivion for not turning on optimizations. (since that's what usually happens) — Mysticial, Nov 24 '14 at 01:47
@Mystical Thank you for your input. However, it does not help. I hope that you read the question I posed fully and the subsequent comments and saw that this is an academic exercise for me to understand the impact of SIMD. If you still feel that my question is polluting SO then by all means go and down-vote it. — Gros Lalo, Nov 24 '14 at 05:27
Whomever downvoted: it is important to give an explanation why this was done in connection to the question. Kindly, refrain any general explanation like "it is pointless, etc. etc.". E.g. tell why the *for* loop in both codes are not equivalent or, what extra instructions are introduced at compilation time that makes the 2 runs non-comparable. At the end of the day people are asking questions for a purpose and taking the effort to write up as much detail as possible to get an answer, if it beats you to take the time to explain, then there is no point for you to stick around and down-vote. — Gros Lalo, Nov 24 '14 at 09:30
While people here suggesting the -o2 option have some valid points, there is simply no cure for that kind of a strong RAW dependency in OP's example. RAW dependencies render EVERY performance enhancing feature on ANY architecture useless. Besides, -o options might affect the resulting machine codes quite a lot, but it has relatively little effect on runtime of the resulting codes on CISC architectures like x86. It's a different story on RISC machines like ARM though. — Jake 'Alquimista' LEE, Nov 24 '14 at 09:55
@GrosLalo I can also give you an answer why the non-SIMD version took so little time when compiled with the -O2 option: The compiler most probably did calculate the final results at build-time since you fed it with all the required numbers as constants. All the CPU does at runtime is then simply returning the pre-calculated numbers. When writing test codes, you should hide the test values from the functions: they shall get the numbers at runtime via arguments for example. Last but not least, people are right about the -o2 option. It wasn't exactly the culprit this time, but it will be. — Jake 'Alquimista' LEE, Nov 24 '14 at 12:26
@Jake'Alquimista'LEE Thanks for the explanation. I am not disagreeing with you or anyone else. Hopefully this is clear:) My responses are simply questions which is helping understand better. I modified the code to give the loop size as an argument and with -O2 it still takes 0.002s. I also tested with -O1 and got 0.447s. I will study the generated -O1 code a bit more. — Gros Lalo, Nov 24 '14 at 13:13
@GrosLalo The compiler seems to "cheat" again - this time with single-pass multiplications at runtime: a += n*b; All in all, it's really pleasant to see young people putting efforts in learning new stuff. Take a look at QnAs with [neon] tag. I think knowledge in NEON will probably be more useful than SSE for your career. — Jake 'Alquimista' LEE, Nov 24 '14 at 13:26
@GrosLalo: I've put together some example code in an answer below - this should hopefully illustrate the benefits of SIMD versus scalar code a little better for you. — Paul R, Nov 24 '14 at 22:59

Paul R · Answer 1 · 2014-11-25T09:34:25.663

I put together some sample code below to illustrate how you might see the benefits of SIMD versus scalar code. The example code is a little contrived, but the main point to note is that there need to be sufficient arithmetic operations in the loop to mitigate load/store latency and loop overheads - a single add operation, as in your initial experiment, is not sufficient.

This example achieves around 4x throughput improvement for 32 bit int data. There are two versions of the SIMD loop: one simple loop with no unrolling, and an alternate loop with 2x unrolling. As might be expected the unrolled loop is a little faster.

#include <assert.h>
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <sys/time.h>   // gettimeofday
#include <smmintrin.h>  // SSE 4.1

static void foo_scalar(uint32_t *a, const uint32_t *b, const uint32_t *c, size_t n)
{
    for (size_t i = 0; i < n; ++i)
    {
        a[i] = (b[i] + c[i] + 1) / 2;
    }
}

static void foo_simd(uint32_t *a, const uint32_t *b, const uint32_t *c, size_t n)
{
    size_t i;

#ifndef UNROLL
    for (i = 0; i <= n - 4; i += 4)
    {
        __m128i vb = _mm_loadu_si128((__m128i *)&b[i]);
        __m128i vc = _mm_loadu_si128((__m128i *)&c[i]);
        __m128i v = _mm_add_epi32(vb, vc);
        v = _mm_add_epi32(v, _mm_set1_epi32(1));
        v = _mm_srli_epi32(v, 1);
        _mm_storeu_si128((__m128i *)&a[i], v);
    }
#else
    for (i = 0; i <= n - 8; i += 8)
    {
        __m128i vb0 = _mm_loadu_si128((__m128i *)&b[i]);
        __m128i vb1 = _mm_loadu_si128((__m128i *)&b[i + 4]);
        __m128i vc0 = _mm_loadu_si128((__m128i *)&c[i]);
        __m128i vc1 = _mm_loadu_si128((__m128i *)&c[i + 4]);
        __m128i v0 = _mm_add_epi32(vb0, vc0);
        __m128i v1 = _mm_add_epi32(vb1, vc1);
        v0 = _mm_add_epi32(v0, _mm_set1_epi32(1));
        v1 = _mm_add_epi32(v1, _mm_set1_epi32(1));
        v0 = _mm_srli_epi32(v0, 1);
        v1 = _mm_srli_epi32(v1, 1);
        _mm_storeu_si128((__m128i *)&a[i], v0);
        _mm_storeu_si128((__m128i *)&a[i + 4], v1);
    }
#endif
    foo_scalar(&a[i], &b[i], &c[i], n - i);
}

int main(int argc, char *argv[])
{
    const size_t kLoops = 100000;
    size_t n = 2 * 1024;
    struct timeval t0, t1;
    double t_scalar_ms, t_simd_ms;

    if (argc > 1)
    {
        n = atoi(argv[1]);
    }

    printf("kLoops = %zu, n = %zu\n", kLoops, n);

    uint32_t * a_scalar = malloc(n * sizeof(uint32_t));
    uint32_t * a_simd = malloc(n * sizeof(uint32_t));
    uint32_t * b = malloc(n * sizeof(uint32_t));
    uint32_t * c = malloc(n * sizeof(uint32_t));

    for (size_t i = 0; i < n; ++i)
    {
        a_scalar[i] = a_simd[i] = 0;
        b[i] = rand();
        c[i] = rand();
    }

    gettimeofday(&t0, NULL);
    for (size_t k = 0; k < kLoops; ++k)
    {
        foo_scalar(a_scalar, b, c, n);
    }
    gettimeofday(&t1, NULL);
    t_scalar_ms = ((double)(t1.tv_sec - t0.tv_sec) + (double)(t1.tv_usec - t0.tv_usec) * 1.0e-6) * 1.0e3;

    gettimeofday(&t0, NULL);
    for (size_t k = 0; k < kLoops; ++k)
    {
        foo_simd(a_simd, b, c, n);
    }
    gettimeofday(&t1, NULL);
    t_simd_ms = ((double)(t1.tv_sec - t0.tv_sec) + (double)(t1.tv_usec - t0.tv_usec) * 1.0e-6) * 1.0e3;

    int64_t sum_scalar = 0, sum_simd = 0;
    for (size_t i = 0; i < n; ++i)
    {
        sum_scalar += a_scalar[i];
        sum_simd += a_simd[i];
    }
    assert(sum_scalar == sum_simd);

    printf("t_scalar = %8g ms = %8g ns / point\n", t_scalar_ms, t_scalar_ms / (kLoops * n) * 1e6);
    printf("t_simd   = %8g ms = %8g ns / point\n", t_simd_ms, t_simd_ms / (kLoops * n) * 1e6);
    printf("Speed-up = %2.1fx\n",  t_scalar_ms / t_simd_ms);

    return 0;
}

Compile and run (no SIMD loop unrolling):

$ gcc-4.8 -fno-tree-vectorize -std=gnu99 -Wall gros_lalo.c -O3 -msse4.1 && ./a.out
kLoops = 100000, n = 2048
t_scalar =  122.668 ms = 0.598965 ns / point
t_simd   =   33.785 ms = 0.164966 ns / point
Speed-up = 3.6x

Compile and run (2x SIMD loop unrolling):

$ gcc-4.8 -fno-tree-vectorize -std=gnu99 -Wall gros_lalo.c -O3 -msse4.1 -DUNROLL && ./a.out
kLoops = 100000, n = 2048
t_scalar =  121.897 ms =   0.5952 ns / point
t_simd   =    29.07 ms = 0.141943 ns / point
Speed-up = 4.2x

It is interesting to look at the generated code:

Scalar:

    xorl    %ecx, %ecx
    .align 4
L10:
    movl    0(%rbp,%rcx,4), %esi
    addl    (%rbx,%rcx,4), %esi
    addl    $1, %esi
    shrl    %esi
    movl    %esi, (%r15,%rcx,4)
    addq    $1, %rcx
    cmpq    %r12, %rcx
    jne L10

SIMD (no unrolling):

    xorl    %ecx, %ecx
    xorl    %eax, %eax
    .align 4
L18:
    movdqu  0(%rbp,%rcx), %xmm2
    addq    $4, %rax
    movdqu  (%rbx,%rcx), %xmm1
    paddd   %xmm2, %xmm1
    paddd   %xmm3, %xmm1
    psrld   $1, %xmm1
    movdqu  %xmm1, (%r14,%rcx)
    addq    $16, %rcx
    cmpq    %r9, %rax
    jbe L18

SIMD (2x unrolling):

    xorl    %edx, %edx
    xorl    %ecx, %ecx
    .align 4
L18:
    movdqu  0(%rbp,%rdx), %xmm5
    addq    $8, %rcx
    movdqu  (%r11,%rdx), %xmm4
    movdqu  (%rbx,%rdx), %xmm2
    movdqu  (%r10,%rdx), %xmm1
    paddd   %xmm5, %xmm2
    paddd   %xmm4, %xmm1
    paddd   %xmm3, %xmm2
    paddd   %xmm3, %xmm1
    psrld   $1, %xmm2
    psrld   $1, %xmm1
    movdqu  %xmm2, 0(%r13,%rdx)
    movdqu  %xmm1, (%rax,%rdx)
    addq    $32, %rdx
    cmpq    %r15, %rcx
    jbe L18

Note that there are a similar number of instructions in the first two loops, but the SIMD loop is of course processing four elements per iteration, whereas the scalar loop is only processing one element per iteration. For the third, unrolled loop we have more instructions but we are processing eight elements per iteration - note that the proportion of loop housekeeping instructions has been reduced relative to the SIMD loop without loop unrolling.

Timing data was collected using a 2.6 GHz Core i7 Haswell CPU using gcc 4.8 on Mac OS X 10.10. Performance results should be similar on any reasonably current x86 CPU however.

This is excellent. I really appreciate and will study this right away. — Gros Lalo, Nov 25 '14 at 03:49

score 3 · Answer 2 · edited Aug 02 '19 at 01:18

3

The biggest problem here is that you benchmarked with optimization disabled. GCC's default is -O0 debug-mode which keeps all variables in memory between C statements! That's generally useless and massively distorts your results by introducing a store/reload into the dependency chain from the output of one iteration to the input of the next.

Using vector operations exploits SIMD parallelism in your program. But it does not speed up the sequential parts of your program, like the time it takes to load your program or to print to the screen. This limits the maximum speedup your program can attain. This is Amdahl's law.

In addition, your x86 processor takes advantage of a parallelism even in non-SIMD code. Intel's Haswell processor has four scalar-integer ALUs, so it can do 4 adds per clock if 4 add instructions have their inputs ready that cycle.

Two of Haswell's execution ports have SIMD-integer execution units that can run paddd. But your loop only has one dependency chain for paddd, vs. four independent ones for add.

Instruction-throughput bottlenecks are also a factor: the front-end can only supply up to 4 uops per clock. All the store/reload mov instructions mean the scalar version may be bumping into that bottleneck. With 2x mov-load + add + mov-store, the front-end can only supply 1 block of 4 instructions (including 1 add) per clock cycle. But the store-forwarding bottleneck lengthens the dependency chain from 1 cycle for add on its own to about 5 or 6 cycles with add + store/reload, so those dependency chains can still overlap.

So you are comparing the execution time not for a sequential execution and a parallel execution, but of two parallel executions. One with scalar ILP and one with SIMD.

Anti-optimized debug-mode code is a huge bottleneck for your SIMD vector, too. Really it's a bigger bottleneck because there's less other work to fill that gap created by the latency. SIMD store/reload is about a cycle higher latency than scalar integer, too.

See https://stackoverflow.com/tags/x86/info and https://agner.org/optimize/ for more details. Also David Kanter's Haswell microarchitecture deep dive for some block diagrams of the CPU along with explanations.

edited Aug 02 '19 at 01:18

Peter Cordes

328,167
45
605
847

answered Nov 22 '14 at 09:00

Craig S. Anderson

6,966
4
33
46

Thanks Craig for the prompt reply. However, given that both codes are running on the same computer and let's assume that there are 2 ALUs. Wouldn't we then see 2 times faster case in those examples? I think that the important part of my program are is the code block that is being executed millions of times in *sequence* in Alg_A and then in *parallel* in Alg_B. All other codes around this are trivial in terms of execution time. Does Amdahläs Law apply in this code block? – Gros Lalo Nov 22 '14 at 09:37
You should time a program that loops millions of times but does nothing inside the loop. You will need to do this with optimizations off. This will tell you the program overhead time. Another thing going on is the speed of a vector add versus an ALU add. They aren't guaranteed to take the same amount of time. It depends on your processor. – Craig S. Anderson Nov 22 '14 at 10:08
can you add code to time only the loop and post what you get then? – midor Nov 22 '14 at 11:17
@CraigAnderson and cmaster: i think you both misunderstood. Perhaps i clarify here: the program **runs** (and **not** loads) in 2-3s. Also i am running the loop **1 billion** times (i.e. 1000 million). Just wanted to make things clear (am not intending to dumb things out). Thank you for your understanding. – Gros Lalo Nov 22 '14 at 11:18
@midor I am measuring time with time() function. It should be fine in this case as the I am running the loop 1 billion times compared to other instructions (like declaring arrays etc.) – Gros Lalo Nov 22 '14 at 11:28
1

An Intel Xeon E5520 running at 2.27 GHz takes 3 seconds to run an unoptimized empty **for** loop that runs for 1 billion iterations. (Source: me). That 3 seconds is all loop overhead which is not parallelizable. You need to measure how much of *your* program is in the non parallelizable part. – Craig S. Anderson Nov 22 '14 at 11:28
@CraigAnderson: I get about =~ 2.5 times gain. Would you be able to edit your answer and add the part about measuring the duration of running an empty loop and substracting that from the running time of the algorithms. I will then mark it as a solution. Thanks again for your help. – Gros Lalo Nov 22 '14 at 11:40
@CraigAnderson: I am considering to make your proposal an answer because I assume that it is not possible to actually get 4 times. Is that so? – Gros Lalo Nov 22 '14 at 11:57
@GrosLalo Yes, it is possible to get close to 4x speedup, but not with your program. It all depends on how much work you do within one iteration, how much memory you touch in an iteration, and how well you can vectorize what you do. There are not too many cases where you can get 4x improvement in practice any more, though: the gap between CPU performance and memory performance has become too wide with the result that most code is memory bound and the CPU is twiddling its thumbs... – cmaster - reinstate monica Nov 22 '14 at 14:28
1

Intel's Haswell processor has 4 integer ALUs, they dispatch out of ports 0,1,5 and 6. see http://www.realworldtech.com/haswell-cpu/4/ – amdn Nov 22 '14 at 19:39
@cmaster Thank you for the info. Could you tell why Alg_B code is a bad candidate for a code that would reach 4x speed-up? I looked at the assembly code generated for that algorithm and it is effectively just 4 instructions (see lines 76 -> 84). Thank you in advance. – Gros Lalo Nov 22 '14 at 21:49
@GrosLalo There's simply not enough work in an iteration, the loop control dominates your time. You could say your example is loop control bound. To get good SIMD speedup, you need a loop that's computation bound. And to get 4x speedup, the computation must take more time than any other operation by a factor of 4. Your measurement clearly shows that this is not the case with the algorithm that you've chosen. (By "algorithm" I do *not* mean your Alg_B code, I mean the abstract algorithm that is implemented by both codes.) – cmaster - reinstate monica Nov 22 '14 at 22:00
@cmaster Thanks again for the prompt reply. Interesting point. Do you imply that based on the generated assembly code (i.e. algorithm) above, there is idling happening? Perhaps, this is the 'instruction latency' reference that jake-alquimista-lee' is pointing to below. Again thanks in advance for your keen interest in helping me understand this issue. – Gros Lalo Nov 22 '14 at 22:33
@GrosLalo Latency is one of the two factors in calculating how fast an assembler code can be executed. If you need the result of one instruction for the input of the next, the two instructions can't be handled in parallel. Most arithmetic instructions have a latency of one cycle, memory loads need to wait for the cache which gives them a long latency. The other factor is the available resources. If you have one ALU, you can't process two arithmetic instructions in parallel. You can find a lot of the CPU dependent details here: http://agner.org/optimize/instruction_tables.pdf – cmaster - reinstate monica Nov 23 '14 at 08:53
@cmaster Thanks you for the tip about the pdf (and website). I counted the latencies associated to Alg_B and there would be between (0-3) for the 4 instructions in the block. It does not seem to be too much cycle wastage. Also, 'admn' pointed out that there are 4 ALUs in haswell (i am using i5-4300U) and also gave a link. I went through it and it was overwhelming :) Perhaps,that those 4 ALUs does give advantage partial advantage to Alg_A so that timing wise, Alg_B is going to be in best case =~ 2.5 times Alg_A for my computer. It would be interesting to see how these two codes behave elsewhere – Gros Lalo Nov 23 '14 at 09:22
@cmaster Correct me if I'm wrong. I don't know much about x86, but the OP's code wouldn't suffer from any loop overheads on a high-end ARM SoC with the Cortex-A15: the cmp instruction will be issued in parallel with the first SIMD instruction and backward jump will simply vanish due to the branch folding feature. I can't imagine that a modern x86 CPU doesn't have these features. – Jake 'Alquimista' LEE Nov 23 '14 at 14:41
1

@Jake'Alquimista'LEE I'm sorry, I'm really not into the details of x86. I simply saw the comment by Craig Anderson saying that an unoptimized, empty loop takes roughly the same time as the loop given by the OP. Consequently, I jumped to the conclusion that the loop of the OP is loop control bound. That's all, and I may be very wrong. – cmaster - reinstate monica Nov 23 '14 at 20:36
@cmaster: yup, the XMM store-forwarding + `paddd` latency is mostly hidden by the empty-loop latency with a loop counter in memory(!!!). This is all super pointless; Sandybridge-family store-forwarding latency is sensitive to trying to reload too soon: that makes it worse. So there are huge distortions and this isn't telling us much about what a real sum-of-array loop would be like: [Adding a redundant assignment speeds up code when compiled without optimization](//stackoverflow.com/q/49189685) I made an edit to this answer to account for the fact that most of the latency is store/reload – Peter Cordes Aug 02 '19 at 01:32
@Craig: let me know if you'd rather I move most of my edit to a separate answer. It ended up being bigger than I planned when I started correcting the add throughput info. – Peter Cordes Aug 02 '19 at 01:34
@PeterCordes - Sure, that it fine with me. Whatever you think is best. – Craig S. Anderson Aug 21 '19 at 16:05
Since ok with what your answer currently says, I'd rather leave my edit in place instead of spending more effort. :P – Peter Cordes Aug 21 '19 at 16:45

score 0 · Accepted Answer · answered Nov 22 '14 at 17:15

0

That must be the instruction latency. (RAW dependency) While the ALU instructions have little to no latency, ie the results can be the operands for the next instruction without any delay, SIMD instructions tend to have long latencies until the results are available even for such simple ones like add.

Extend the arrays to 16 or even 32 entries long spanning over 4 or 8 SIMD vectors, and you will see huge differences thanks to instruction scheduling.

NOW: add v latency add v latency . . .

4 vector rotation: add v1 add v2 add v3 add v4 add v1 add v2 . . .

Google for "instruction scheduling" and "raw dependency" for more detailed infos.

answered Nov 22 '14 at 17:15

Jake 'Alquimista' LEE

6,197
2
17
25

@Jake-alquimista-lee Thank you for the info. I will read-up about "instruction scheduling" and "raw dependency". Just to be clear, you think that there is a lot of latencies happening between the 4 generated instructions for the vector addition in Alg_B (Ref: 'lines' 76 -> 84 object dump i pasted earlier)? Thanks in advance. – Gros Lalo Nov 22 '14 at 21:54
Instruction latencies vary from processor to processor. Which processor are you using? – Craig S. Anderson Nov 22 '14 at 23:24
@CraigAnderson Haswell - Intel(R) Core(TM) i5-4300U CPU @ 1.90GHz – Gros Lalo Nov 23 '14 at 08:23
@CraigAnderson If you were asking me, I'm not much of an Intel guy, but ARM. ARM's NEON, the SSE equivalent has 3~4 cycles latencies for simple operations like add and over six cycles for multiplications. I'm quite confident that OP's SIMD routine suffers from the pipeline hazards since it uses the result of the previous iteration as an operand, and RAW dependencies cripple all the fancy performance enhancing features like superscalar and OOOE. – Jake 'Alquimista' LEE Nov 23 '14 at 13:42
`paddd` has the same 1-cycle latency as scalar `add` on Haswell, with 2-per-clock SIMD integer add throughput vs. 4-per-clock for scalar `add`. **The problem is compiling with `-O0` so store-forwarding latency dominates the loop-carried dependency chain.** (And XMM store/reload latency is 1 cycle higher than scalar.) For the scalar case, the 4 independent store/reload chains mostly overlap with each other so yes there's more ILP there so it's not 4x slower. And yes, doing more than 1 vector at a time would allow the SIMD code to have some ILP, too. – Peter Cordes Aug 02 '19 at 01:01
@PeterCordes : Maybe would have been better to make your edit a new answer? Was a pretty significant edit. – Michael Petch Aug 02 '19 at 01:25
@MichaelPetch: on Craig's answer? Yeah maybe. I started editing to fix the mistake where it claimed Haswell had 2/clock `add` throughput, and kind of kept going further than I'd initially planned. – Peter Cordes Aug 02 '19 at 01:27
@PeterCordes : Yes, was referring to the other question sorry. I pinged you here because I knew you'd get it. Wasn't sure you get notified in an answer you have edited. – Michael Petch Aug 02 '19 at 01:31
@MichaelPetch: Yes, you can \@notify people that have edited a post when commenting under it. – Peter Cordes Aug 02 '19 at 01:34

Demonstrator code failing to show 4 times faster SIMD speed with optimization disabled

3 Answers3

Linked

Related