Why vectorizing the loop over 64-bit elements does not have performance improvement over large buffers?

Question

I am investigating the effect of vectorization on the performance of the program. In this regard, I have written following code:

#include <stdio.h>
#include <sys/time.h>
#include <stdlib.h>

#define LEN 10000000

int main(){

    struct timeval stTime, endTime;

    double* a = (double*)malloc(LEN*sizeof(*a));
    double* b = (double*)malloc(LEN*sizeof(*b));
    double* c = (double*)malloc(LEN*sizeof(*c));

    int k;
    for(k = 0; k < LEN; k++){
        a[k] = rand();
        b[k] = rand();
    }

    gettimeofday(&stTime, NULL);

    for(k = 0; k < LEN; k++)
        c[k] = a[k] * b[k];

    gettimeofday(&endTime, NULL);

    FILE* fh = fopen("dump", "w");
    for(k = 0; k < LEN; k++)
        fprintf(fh, "c[%d] = %f\t", k, c[k]);
    fclose(fh);

    double timeE = (double)(endTime.tv_usec + endTime.tv_sec*1000000 - stTime.tv_usec - stTime.tv_sec*1000000);

    printf("Time elapsed: %f\n", timeE);

    return 0;
}

In this code, I am simply initializing and multiplying two vectors. The results are saved in vector c. What I am mainly interested in is the effect of vectorizing following loop:

for(k = 0; k < LEN; k++)
    c[k] = a[k] * b[k];

I compile the code using following two commands:

1) icc -O2 TestSMID.c -o TestSMID -no-vec -no-simd
2) icc -O2 TestSMID.c -o TestSMID -vec-report2

I expect to see performance improvement since the second command successfully vectorizes the loop. However, my studies show that there is no performance improvement when the loop is vectorized.

I may have missed something here since I am not super familiar with the topic. So, please let me know if there is something wrong with my code.

Thanks in advance for your help.

PS: I am using Mac OSX, so there is no need to align the data as all the allocated memories are 16-byte aligned.

Edit: I would like to first thank you all for your comments and answers. I thought about the answer proposed by @Mysticial and there are some further points that should be mentioned here. Firstly, as @Vinska mentioned, c[k]=a[k]*b[k] does not take only one cycle. In addition to loop index increment and the comparison made to ensure that k is smaller than LEN, there are other things to be done to perform the operation. Having a look at the assembly code generated by the compiler, it can be seen that a simple multiplication needs much more than one cycle. The vectorized version looks like:

L_B1.9:                         # Preds L_B1.8
        movq      %r13, %rax                                    #25.5
        andq      $15, %rax                                     #25.5
        testl     %eax, %eax                                    #25.5
        je        L_B1.12       # Prob 50%                      #25.5
                                # LOE rbx r12 r13 r14 r15 eax
L_B1.10:                        # Preds L_B1.9
        testb     $7, %al                                       #25.5
        jne       L_B1.32       # Prob 10%                      #25.5
                                # LOE rbx r12 r13 r14 r15
L_B1.11:                        # Preds L_B1.10
        movsd     (%r14), %xmm0                                 #26.16
        movl      $1, %eax                                      #25.5
        mulsd     (%r15), %xmm0                                 #26.23
        movsd     %xmm0, (%r13)                                 #26.9
                                # LOE rbx r12 r13 r14 r15 eax
L_B1.12:                        # Preds L_B1.11 L_B1.9
        movl      %eax, %edx                                    #25.5
        movl      %eax, %eax                                    #26.23
        negl      %edx                                          #25.5
        andl      $1, %edx                                      #25.5
        negl      %edx                                          #25.5
        addl      $10000000, %edx                               #25.5
        lea       (%r15,%rax,8), %rcx                           #26.23
        testq     $15, %rcx                                     #25.5
        je        L_B1.16       # Prob 60%                      #25.5
                                # LOE rdx rbx r12 r13 r14 r15 eax
L_B1.13:                        # Preds L_B1.12
        movl      %eax, %eax                                    #25.5
                                # LOE rax rdx rbx r12 r13 r14 r15
L_B1.14:                        # Preds L_B1.14 L_B1.13
        movups    (%r15,%rax,8), %xmm0                          #26.23
        movsd     (%r14,%rax,8), %xmm1                          #26.16
        movhpd    8(%r14,%rax,8), %xmm1                         #26.16
        mulpd     %xmm0, %xmm1                                  #26.23
        movntpd   %xmm1, (%r13,%rax,8)                          #26.9
        addq      $2, %rax                                      #25.5
        cmpq      %rdx, %rax                                    #25.5
        jb        L_B1.14       # Prob 99%                      #25.5
        jmp       L_B1.20       # Prob 100%                     #25.5
                                # LOE rax rdx rbx r12 r13 r14 r15
L_B1.16:                        # Preds L_B1.12
        movl      %eax, %eax                                    #25.5
                                # LOE rax rdx rbx r12 r13 r14 r15
L_B1.17:                        # Preds L_B1.17 L_B1.16
        movsd     (%r14,%rax,8), %xmm0                          #26.16
        movhpd    8(%r14,%rax,8), %xmm0                         #26.16
        mulpd     (%r15,%rax,8), %xmm0                          #26.23
        movntpd   %xmm0, (%r13,%rax,8)                          #26.9
        addq      $2, %rax                                      #25.5
        cmpq      %rdx, %rax                                    #25.5
        jb        L_B1.17       # Prob 99%                      #25.5
                                # LOE rax rdx rbx r12 r13 r14 r15
L_B1.18:                        # Preds L_B1.17
        mfence                                                  #25.5
                                # LOE rdx rbx r12 r13 r14 r15
L_B1.19:                        # Preds L_B1.18
        mfence                                                  #25.5
                                # LOE rdx rbx r12 r13 r14 r15
L_B1.20:                        # Preds L_B1.14 L_B1.19 L_B1.32
        cmpq      $10000000, %rdx                               #25.5
        jae       L_B1.24       # Prob 0%                       #25.5
                                # LOE rdx rbx r12 r13 r14 r15
L_B1.22:                        # Preds L_B1.20 L_B1.22
        movsd     (%r14,%rdx,8), %xmm0                          #26.16
        mulsd     (%r15,%rdx,8), %xmm0                          #26.23
        movsd     %xmm0, (%r13,%rdx,8)                          #26.9
        incq      %rdx                                          #25.5
        cmpq      $10000000, %rdx                               #25.5
        jb        L_B1.22       # Prob 99%                      #25.5
                                # LOE rdx rbx r12 r13 r14 r15
L_B1.24:                        # Preds L_B1.22 L_B1.20

And non-vectoized version is:

L_B1.9:                         # Preds L_B1.8
        xorl      %eax, %eax                                    #25.5
                                # LOE rbx r12 r13 r14 r15 eax
L_B1.10:                        # Preds L_B1.10 L_B1.9
        lea       (%rax,%rax), %edx                             #26.9
        incl      %eax                                          #25.5
        cmpl      $5000000, %eax                                #25.5
        movsd     (%r15,%rdx,8), %xmm0                          #26.16
        movsd     8(%r15,%rdx,8), %xmm1                         #26.16
        mulsd     (%r13,%rdx,8), %xmm0                          #26.23
        mulsd     8(%r13,%rdx,8), %xmm1                         #26.23
        movsd     %xmm0, (%rbx,%rdx,8)                          #26.9
        movsd     %xmm1, 8(%rbx,%rdx,8)                         #26.9
        jb        L_B1.10       # Prob 99%                      #25.5
                                # LOE rbx r12 r13 r14 r15 eax

Beside this, the processor does not load only 24 bytes. In each access to memory, a full line (64 bytes) is loaded. More importantly, since the memory required for a, b, and c is contiguous, prefetcher would definitely help a lot and loads next blocks in advance. Having said that, I think the memory bandwidth calculated by @Mysticial is too pessimistic.

Moreover, using SIMD to improve the performance of program for a very simple addition is mentioned in Intel Vectorization Guide. Therefore, it seems we should be able to gain some performance improvement for this very simple loop.

Edit2: Thanks again for your comments. Also, thank to @Mysticial sample code, I finally saw the effect of SIMD on performance improvement. The problem, as Mysticial mentioned, was the memory bandwidth. With choosing small size for a, b, and c which fit into the L1 cache, it can be seen that SIMD can help to improve the performance significantly. Here are the results that I got:

icc -O2 -o TestSMIDNoVec -no-vec TestSMID2.c: 17.34 sec

icc -O2 -o TestSMIDVecNoUnroll -vec-report2 TestSMID2.c: 9.33 sec

And unrolling the loop improves the performance even further:

icc -O2 -o TestSMIDVecUnroll -vec-report2 TestSMID2.c -unroll=8: 8.6sec

Also, I should mention that it takes only one cycle for my processor to complete an iteration when compiled with -O2.

PS: My computer is a Macbook Pro core i5 @2.5GHz (dual core)

I've just updated my answer to prove that my processor is able to do 1 iteration per cycle as well as explanation of how it is possible. — Mysticial, Aug 10 '13 at 20:16
I really hate to bring this up, but the build commands place both versions of the executable in the same file. It would have been much clearer if the two versions had different names. — wallyk, Aug 11 '13 at 03:07
You say there's "no need to align", but the asm code generated checks for all alignment possibilities. There's one loop for srces unaligned, and one using `mulpd` with a memory operand. However, even the aligned version uses the weird `movsd` + `movhpd` sequence to load 128b. I think that's for `c` and `a` aligned, `b` unaligned (after scalar intro). I think I remember reading that on some older architectures, a 2 insn sequence was sometimes faster than `movupd`. The only-dest-aligned version of the loop uses `movupd` for one source, and the 2 insn method for the other, /boggle. — Peter Cordes, Jul 05 '15 at 02:51

Mysticial · Accepted Answer · 2017-03-31T19:07:15.463

81

This original answer was valid back in 2013. As of 2017 hardware, things have changed enough that both the question and the answer are out-of-date.

See the end of this answer for the 2017 update.

Original Answer (2013):

Because you're bottlenecked by memory bandwidth.

While vectorization and other micro-optimizations can improve the speed of computation, they can't increase the speed of your memory.

In your example:

for(k = 0; k < LEN; k++)
    c[k] = a[k] * b[k];

You are making a single pass over all the memory doing very little work. This is maxing out your memory bandwidth.

So regardless of how it's optimized, (vectorized, unrolled, etc...) it isn't gonna get much faster.

A typical desktop machine of 2013 has on the order of 10 GB/s of memory bandwidth*.
Your loop touches 24 bytes/iteration.

Without vectorization, a modern x64 processor can probably do about 1 iteration a cycle*.

Suppose you're running at 4 GHz:

(4 * 10^9) * 24 bytes/iteration = 96 GB/s

That's almost 10x of your memory bandwidth - without vectorization.

*Not surprisingly, a few people doubted the numbers I gave above since I gave no citation. Well those were off the top of my head from experience. So here's some benchmarks to prove it.

The loop iteration can run as fast as 1 cycle/iteration:

We can get rid of the memory bottleneck if we reduce LEN so that it fits in cache.
(I tested this in C++ since it was easier. But it makes no difference.)

#include <iostream>
#include <time.h>
using std::cout;
using std::endl;

int main(){
    const int LEN = 256;

    double *a = (double*)malloc(LEN*sizeof(*a));
    double *b = (double*)malloc(LEN*sizeof(*a));
    double *c = (double*)malloc(LEN*sizeof(*a));

    int k;
    for(k = 0; k < LEN; k++){
        a[k] = rand();
        b[k] = rand();
    }

    clock_t time0 = clock();

    for (int i = 0; i < 100000000; i++){
        for(k = 0; k < LEN; k++)
            c[k] = a[k] * b[k];
    }

    clock_t time1 = clock();
    cout << (double)(time1 - time0) / CLOCKS_PER_SEC << endl;
}

Processor: Intel Core i7 2600K @ 4.2 GHz
Compiler: Visual Studio 2012
Time: 6.55 seconds

In this test, I ran 25,600,000,000 iterations in only 6.55 seconds.

6.55 * 4.2 GHz = 27,510,000,000 cycles
27,510,000,000 / 25,600,000,000 = 1.074 cycles/iteration

Now if you're wondering how it's possible to do:

2 loads
1 store
1 multiply
increment counter
compare + branch

all in one cycle...

It's because modern processors and compilers are awesome.

While each of these operations have latency (especially the multiply), the processor is able to execute multiple iterations at the same time. My test machine is a Sandy Bridge processor, which is capable of sustaining 2x128b loads, 1x128b store, and 1x256b vector FP multiply every single cycle. And potentially another one or two vector or integer ops, if the loads are memory source operands for micro-fused uops. (2 loads + 1 store throughput only when using 256b AVX loads/stores, otherwise only two total memory ops per cycle (at most one store)).

Looking at the assembly (which I'll omit for brevity), it seems that the compiler unrolled the loop, thereby reducing the looping overhead. But it didn't quite manage to vectorize it.

Memory bandwidth is on the order of 10 GB/s:

The easiest way to test this is via a memset():

#include <iostream>
#include <time.h>
using std::cout;
using std::endl;

int main(){
    const int LEN = 1 << 30;    //  1GB

    char *a = (char*)calloc(LEN,1);

    clock_t time0 = clock();

    for (int i = 0; i < 100; i++){
        memset(a,0xff,LEN);
    }

    clock_t time1 = clock();
    cout << (double)(time1 - time0) / CLOCKS_PER_SEC << endl;
}

Processor: Intel Core i7 2600K @ 4.2 GHz
Compiler: Visual Studio 2012
Time: 5.811 seconds

So it takes my machine 5.811 seconds to write to 100 GB of memory. That's about 17.2 GB/s.

And my processor is on the higher end. The Nehalem and Core 2 generation processors have less memory bandwidth.

Update March 2017:

As of 2017, things have gotten more complicated.

Thanks to DDR4 and quad-channel memory, it is no longer possible for a single thread to saturate memory bandwidth. But the problem of bandwidth doesn't necessarily go away. Even though bandwidth has gone up, processor cores have also improved - and there are more of them.

To put it mathematically:

Each core has a bandwidth limit X.
Main memory has a bandwidth limit of Y.
On older systems, X > Y.
On current high-end systems, X < Y. But X * (# of cores) > Y.

Back in 2013: Sandy Bridge @ 4 GHz + dual-channel DDR3 @ 1333 MHz

No vectorization (8-byte load/stores): X = 32 GB/s and Y = ~17 GB/s
Vectorized SSE* (16-byte load/stores): X = 64 GB/s and Y = ~17 GB/s

Now in 2017: Haswell-E @ 4 GHz + quad-channel DDR4 @ 2400 MHz

No vectorization (8-byte load/stores): X = 32 GB/s and Y = ~70 GB/s
Vectorized AVX* (32-byte load/stores): X = 64 GB/s and Y = ~70 GB/s

_{(For both Sandy Bridge and Haswell, architectural limits in the cache will limit bandwidth to about 16 bytes/cycle regardless of SIMD width.)}

So nowadays, a single thread will not always be able to saturate memory bandwidth. And you will need to vectorize to achieve that limit of X. But you will still hit the main memory bandwidth limit of Y with 2 or more threads.

But one thing hasn't changed and probably won't change for a long time: You will not be able to run a bandwidth-hogging loop on all cores without saturating the total memory bandwidth.

edited Mar 31 '17 at 19:07

answered Aug 10 '13 at 07:02

Mysticial

464,885
45
335
332

Thanks for your answer. You are right. I complicated the things and experienced the performance improvement. – Pouya Aug 10 '13 at 07:11
13

+1: this needs to be in an FAQ or to become a "go to" answer - a large proportion of beginner optimisation questions seem to fall into this category. – Paul R Aug 10 '13 at 07:12
What if we compile it with -O0? Does CPU perform each iteration in one cycle? – Pouya Aug 10 '13 at 20:38
@Pouya It depends, but I doubt it. Unoptimized code tends to have tons of useless instructions that are there just because they are there and you didn't tell the compiler to remove them. – Mysticial Aug 10 '13 at 20:41
@Mysticial, is it possible to block the loop `c[k] = a[k] * b[k]` to make it cache-friendly (for large `LEN`), or does that only work when you need to re-use data like in matrix multiplication? – matmul Jun 10 '14 at 15:25
3

@matmul It only works when you re-use data. If everything is being touched only once, there isn't much that can be done. – Mysticial Jun 10 '14 at 16:51
Great answer, but memory bandwidth should be checked per-scenario. In this dot-product case, i'd check peak read-only BW, which I believe may get much higher, since a write operation is effectively using double BW - you usually have to read the line first, then write it back a while later. This could take up extra buffers, extra slots on the intra-core buses, and eventually extra time on the DRAM channels. – Leeor Nov 09 '14 at 11:44
Note that SnB can only sustain 2x128b loads + 1x128b store per cycle if you use 256b AVX loads/stores. The store port doesn't have its own AGU. 256b memory ops are special, because they keep the data transfer part busy for two cycles, but only need the AGU in the first cycle, leaving ports 2/3 free to calculate a store address. Haswell added a separate store AGU along with widening the data paths, so it can sustain 2x256b loads + 1x256b store per cycle (to/from L1). Source - Agner Fog. I'm just parroting his microarch doc here. – Peter Cordes Jul 05 '15 at 02:19
Your statment `it isn't gonna get much faster` is not necessarily accurate. Particularly on the machine you tested on I think that [using multiple threads you would get at least 50% more bandwidth](http://stackoverflow.com/questions/25179738/measuring-memory-bandwidth-from-the-dot-product-of-two-arrays). – Z boson Sep 25 '15 at 08:04
5

@Zboson Obviously it depends on the machine. You're unlikely to get full bandwidth on a single-thread on a machine with multiple NUMA nodes. On Haswell-E, the memory is fast enough where you may need to vectorize to max out bandwidth with only a single thread. That said, it doesn't take away from the point though. The code in this question is going to run into bandwidth problems sooner or later. – Mysticial Sep 25 '15 at 13:51
If I replace `memset` with `memcpy(a,b,LEN)` (`char *b = (char*)calloc(LEN,1);`) in your code and use the same `LEN` it takes about the same time. This does not make sense to me because the total memory read and written should be twice. I would expect it to take twice as long. I am testing on a Skylake (i7-6700HQ@2.60GHz) with DDR4. – Z boson Mar 29 '17 at 13:08
@Zboson Good point. [But looking around](https://www.aida64.com/sites/default/files/shot6_cachemem_ryzen_0.png), it seems that copy-bandwidth is about the same as read or write. Perhaps memory is bi-directional. – Mysticial Mar 29 '17 at 16:02
I wrote a benchmarking util 2.5 years ago. It takes twice as long for memcpy on my system. I don't see what the difference is yet looking at the code. I was much more on top of this 2 years ago. I can't get this simple test to work though. I mean memcpy takes about the same time as memset. It does not make sense. I will probably kick myself when I figure it out. – Z boson Mar 30 '17 at 12:48
If I initialize the array `b` to any value (including) zero then `memcpy` twice as long as I expect. – Z boson Mar 30 '17 at 13:51
@Zboson What's even stranger is that `calloc()` doesn't always physically zero the memory. It leaves it unmapped so the OS zeros it on demand. That should theoretically make `memcpy()` even slower. – Mysticial Mar 30 '17 at 15:39
It turns out it has to me initialized to a non-zero value (`e.g. memset(b,1,LEN)`. Maybe since the OS knows the pages are zero it only reads one page. That would explain twice the bandwidth. I don't see the problem on my Haswell system (I have the OS on a SSD so I can test exactly the same config on each system). I only see it on my Skylake system. – Z boson Mar 31 '17 at 11:39
In case you care, here is what triggered me looking into this. Several edits later I got it right finally. If I had not compared to the maximum bandwidth I may not have realized something was wrong http://stackoverflow.com/a/42972674/2542702 – Z boson Mar 31 '17 at 13:07
@Zboson I've went a head and updated my answer to address some of the quirks of current hardware (as of 2017). – Mysticial Mar 31 '17 at 19:06
I made a question about the `memcpy` and `calloc` issue https://stackoverflow.com/questions/43168595/memcpy-takes-the-same-time-as-memset – Z boson Apr 02 '17 at 12:59
Okay, so something else is strange. If I compile with `-O2 -mavx` I get a much higher bandwidth than with `-O2`. In fact, it's much closer to your X = 32 GB/s. I don't understand how vex encoding could make such a difference. I was about to answer another SO question but I can't proceed until I understand this. Benchmarking is a pain and I am out of practice. – Z boson Apr 06 '17 at 12:54
https://stackoverflow.com/questions/43256496/avx-scalar-operations-are-scalar-operations-are-much-faster – Z boson Apr 06 '17 at 13:22
I updated my answer again after the fixing the timing function http://stackoverflow.com/a/42972674/2542702. My answer is more in agreement with your latest update. I still see a small improvement with vectorization and threads (but not with unrolling). With scalar operations I get 60% of the peak. With SSE I get around 68% peak. With scalar operations and two threads I get about 75% of the peak. – Z boson Apr 10 '17 at 11:15
You FLOPS program gets the maximum flops. But what about getting the maximum memory bandwidth? Currently I get maybe 75% of the maximum bandwidth. How can I get close to 100%? I guess I should ask this question on SO. – Z boson Apr 10 '17 at 11:17
@Zboson Bandwidth benchmarks are trickier. You'll need to properly use NT-stores and I believe prefetching as well. Personally, I've only done NT-stores with respect to bandwidth optimizations. But from what I've read, prefetching is needed to get around the size limit of the reorder buffer. Though I don't think that's necessary if you load up all the cores. – Mysticial Apr 10 '17 at 17:05
You write "For both Sandy Bridge and Haswell, architectural limits in the cache will limit bandwidth to about 16 bytes/cycle regardless of SIMD width.)". I guess this is a limit of the L3 cache? Do you have a link, paper, or maybe more to comment on this? I skimmed through Agner's docs and have not found anything yet. – Z boson Sep 04 '18 at 09:47
I found this https://software.intel.com/en-us/forums/intel-moderncode-for-parallel-architectures/topic/608964# which if I read it correctly says the bandwidth for L3 to L2 is best 16 bytes/cycle. I guess that's what you are referring to. I'm surprised I don't know about this already. The reason seems a bit complicated. – Z boson Sep 04 '18 at 13:10
@Zboson My measurements were completely empirical. Just raw numbers from measurement - I never actually tried to look it up or understand to the metal since it seemed pretty obvious that it was cache-related in *some* way. – Mysticial Sep 04 '18 at 15:19
@Mysticial how did you measure that "Each core has a bandwidth limit X"? You measure the bandwidth of main memory but it's not shown in your answer how you measure the bandwidth limit of a core. Normally I would quote the bandwidth limit of a core in L1 but I guess you mean L3. Did you run over memory that fit in L3 to measure the core bandwidth limit? My colleague is doing this now and I think he is seeing much less that 16 bytes/cycle in L3 but I need to verify that with him. – Z boson Sep 05 '18 at 08:02
@Zboson The "core bandwidth" is directly measurable if it's less than the total system memory bandwidth. Otherwise, it's sort of a hand-wavy estimate based on adjusting the size of the dataset to fit into upper-level caches. I don't remember exactly what I did to get the Sandy Bridge numbers. – Mysticial Sep 05 '18 at 16:18
@Mysticial, thank you, I understand now. I'm studying sparse matrix times vector operations so understanding memory bandwidth is important. – Z boson Sep 06 '18 at 09:05

score 3 · Answer 2 · edited Jun 20 '20 at 09:12

As Mysticial already described, main-memory bandwidth limitations are the bottleneck for large buffers here. The way around this is to redesign your processing to work in chunks that fit in the cache. (Instead of multiplying a whole 200MiB of doubles, multiply just 128kiB, then do something with that. So the code that uses the output of the multiply will find it still in L2 cache. L2 is typically 256kiB, and is private to each CPU core, on recent Intel designs.)

This technique is called cache blocking, or loop tiling. It might be tricky for some algorithms, but the payoff is the difference between L2 cache bandwidth vs. main memory bandwidth.

If you do this, make sure the compiler isn't still generating streaming stores (movnt...). Those writes bypass the caches to avoid polluting it with data that won't fit. The next read of that data will need to touch main memory.

librin.so.1 · Answer 3 · 2013-08-11T02:53:12.030

2

EDIT: Modified the answer a lot. Also, please disregard most of what I wrote before about Mystical's answer not being entirely correct. Though, I still do not agree it being bottlenecked by memory, as despite doing a very wide variety of tests, I couldn't see any signs of the original code being bound by memory speed. Meanwhile it kept showing clear signs of being CPU-bound.

There can be many reasons. And since the reason[s] can be very hardware-dependent, I decided I shouldn't speculate based on guesses. Just going to outline these things I encountered during later testing, where I used a much more accurate and reliable CPU time measuring method and looping-the-loop 1000 times. I believe this information could be of help. But please take it with a grain of salt, as it's hardware dependent.

When using instructions from the SSE family, vectorized code I got was over 10% faster vs. non-vectorized code.
Vectorized code using SSE-family and vectorized code using AVX ran more or less with the same performance.
When using AVX instructions, non-vectorized code ran the fastest - 25% or more faster than every other thing I tried.
Results scaled linearly with CPU clock in all cases.
Results were hardly affected by memory clock.
Results were considerably affected by memory latency - much more than memory clock, but not nearly as much as CPU clock affected the results.

WRT Mystical's example of running nearly 1 iteration per clock - I didn't expect the CPU scheduler to be that efficient and was assuming 1 iteration every 1.5-2 clock ticks. But to my surprise, that is not the case; I sure was wrong, sorry about that. My own CPU ran it even more efficiently - 1.048 cycles/iteration. So I can attest to this part of Mystical's answer to be definitely right.

edited Aug 11 '13 at 02:53

answered Aug 10 '13 at 12:32

librin.so.1

375
2
9

`Along with the multiply instruction, the code of the loop has to execute several other instructions as well, including the conditional` Ah, you did not show us the _real_ code. Adding conditionals inside a loop will effectlively screw the branch prediction. BTW the few percent gain you report is futile. You are still bound by bus bandwidth. IMHO the manual unrolling only causes fewer branch prediction misses, since there are fewer iterations. The L1 locality is basically the same. – wildplasser Aug 10 '13 at 12:57
@wildplasser define "real code". Also some other things: the total size of data is 10,000,000 * 8 * 3 = 228 megabytes. On my normal clocks, my theoretical memory bandwidth is 29.8 GB/s. That portion of the code runs for about 1.1 second if I set my CPU to the lowest available clock speed. In that time it can send the whole data 131 times over. So I don't see where a memory bottleneck would occur. Also, a "memory bottleneck" theory wouldn't go with the fact that if I double my CPU clock, that portion of the code starts running twice as fast, while doubling the memory clock does hardly anything. – librin.so.1 Aug 10 '13 at 13:23
@wildplasser Also, few percent? The difference between the fastest non-vectorized and the fastest vectorized one is a bit more than 6.5%. That might not look like much, but it may be very significant on larger scale. With such difference, it would mean e.g. spending 11 hours and 20 minutes of CPU time instead of spending 12 hours. A whooping 40 minutes. Little things do add up, so it's far from "futile" – librin.so.1 Aug 10 '13 at 13:29
Copying to automatic storage avoids/reduces L2 cache effects, It shaves off 30 % here. I'll add it as an answer, since I need the formatting. – wildplasser Aug 10 '13 at 13:54
WRT `real code` : I first thought that you were the OP. Sorry! – wildplasser Aug 10 '13 at 14:26
BTW, I updated the first part of my post. Among other things, I included instruction counts. – librin.so.1 Aug 10 '13 at 14:33

wildplasser · Answer 4 · 2013-08-10T14:09:06.683

0

Just in case a[] b[] and c[] are fighting for the L2 cache ::

#include <string.h> /* for memcpy */

 ...

 gettimeofday(&stTime, NULL);

    for(k = 0; k < LEN; k += 4) {
        double a4[4], b4[4], c4[4];
        memcpy(a4,a+k, sizeof a4);
        memcpy(b4,b+k, sizeof b4);
        c4[0] = a4[0] * b4[0];
        c4[1] = a4[1] * b4[1];
        c4[2] = a4[2] * b4[2];
        c4[3] = a4[3] * b4[3];
        memcpy(c+k,c4, sizeof c4);
        }

    gettimeofday(&endTime, NULL);

Reduces the running time from 98429.000000 to 67213.000000; unrolling the loop 8-fold reduces it to 57157.000000 here.

edited Aug 10 '13 at 14:09

answered Aug 10 '13 at 13:56

wildplasser

43,142
8
66
109

For me it gives a much smaller - only a 2% increase over the OP's vanilla version. (identical results with both 4 and 8 -fold unrolling) – librin.so.1 Aug 10 '13 at 14:54
1

My gain disappears when I turn up optimisation. GCC appears to unroll the loop automatically, and it seems to massage the cache in some way, too. – wildplasser Aug 10 '13 at 15:32

Why vectorizing the loop over 64-bit elements does not have performance improvement over large buffers?

4 Answers4

Linked

Related