No speedup for vector sums with threading

Question

I have a C++ program which basically performs some matrix calculations. For these I use LAPACK/BLAS and usually link to the MKL or ACML depending on the platform. A lot of these matrix calculations operate on different independent matrices and hence I use std::thread's to let these operations run in parallel. However, I noticed that I have no speed-up when using more threads. I traced the problem down to the daxpy Blas routine. It seems that if two threads are using this routine in parallel each thread takes twice the time, even though the two threads operate on different arrays.

The next thing I tried was writing a new simple method to perform vector additions to replace the daxpy routine. With one thread this new method is as fast as the BLAS routine, but, when compiling with gcc, it suffers from the same problems as the BLAS routine: doubling the number of threads running parallel also doubles the amount of time each threads needs, so no speed-up is gained. However, using the Intel C++ Compiler this problems vanishes: with increasing number of threads the time a single thread needs is constant.

However, I need to compile as well on systems where no Intel compiler is available. So my questions are: why is there no speed-up with the gcc and is there any possibility of improving the gcc performance?

I wrote a small program to demonstrate the effect:

// $(CC) -std=c++11 -O2 threadmatrixsum.cpp -o threadmatrixsum -pthread

#include <iostream>
#include <thread>
#include <vector>

#include "boost/date_time/posix_time/posix_time.hpp"
#include "boost/timer.hpp"

void simplesum(double* a, double* b, std::size_t dim);

int main() {

    for (std::size_t num_threads {1}; num_threads <= 4; num_threads++) {
        const std::size_t N { 936 };

        std::vector <std::size_t> times(num_threads, 0);    

        auto threadfunction = [&](std::size_t tid)
        {
            const std::size_t dim { N * N };
            double* pA = new double[dim];
            double* pB = new double[dim];

            for (std::size_t i {0}; i < N; ++i){
                pA[i] = i;
                pB[i] = 2*i;
            }   

            boost::posix_time::ptime now1 = 
                boost::posix_time::microsec_clock::universal_time();    

            for (std::size_t n{0}; n < 1000; ++n){
                simplesum(pA, pB, dim);
            }

            boost::posix_time::ptime now2 = 
                boost::posix_time::microsec_clock::universal_time(); 
            boost::posix_time::time_duration dur = now2 - now1; 
            times[tid] += dur.total_milliseconds(); 
            delete[] pA;
            delete[] pB;
        };

        std::vector <std::thread> mythreads;

        // start threads
        for (std::size_t n {0} ; n < num_threads; ++n)
        {
            mythreads.emplace_back(threadfunction, n);
        }

        // wait for threads to finish
        for (std::size_t n {0} ; n < num_threads; ++n)
        {
            mythreads[n].join();
            std::cout << " Thread " << n+1 << " of " << num_threads 
                      << "  took " << times[n]<< "msec" << std::endl;
        }
    }
}

void simplesum(double* a, double* b, std::size_t dim){

    for(std::size_t i{0}; i < dim; ++i)
    {*(++a) += *(++b);}
}

The outout with gcc:

Thread 1 of 1  took 532msec
Thread 1 of 2  took 1104msec
Thread 2 of 2  took 1103msec
Thread 1 of 3  took 1680msec
Thread 2 of 3  took 1821msec
Thread 3 of 3  took 1808msec
Thread 1 of 4  took 2542msec
Thread 2 of 4  took 2536msec
Thread 3 of 4  took 2509msec
Thread 4 of 4  took 2515msec

The outout with icc:

Thread 1 of 1  took 663msec
Thread 1 of 2  took 674msec
Thread 2 of 2  took 674msec
Thread 1 of 3  took 681msec
Thread 2 of 3  took 681msec
Thread 3 of 3  took 681msec
Thread 1 of 4  took 688msec
Thread 2 of 4  took 689msec
Thread 3 of 4  took 687msec
Thread 4 of 4  took 688msec

So, with the icc the time needed for one thread perform the computations is constant (as I would have expected; my CPU has 4 physical cores) and with the gcc the time for one thread increases. Replacing the simplesum routine by BLAS::daxpy yields the same results for icc and gcc (no surprise, as most time is spent in the library), which are almost the same as the above stated gcc results.

FYI, I see similar results as GCC (using the code provided) under VS2013. — Moo-Juice, May 08 '14 at 19:46
It may seem too obvious, but... do you actually have multiple CPUs? Also, how big are the typical matrices? You might be running into cache contention issues... — twalberg, May 08 '14 at 19:52
I'm curious: having you tried comparing the assembly generated by each compiler? In particular: does icc eliminate everything but the calls for timing/console IO? A sufficiently intelligent optimizer can see that your allocations/operations have no observable effects. — Nathan Ernst, May 08 '14 at 20:05

danielschemmel · Accepted Answer · 2014-05-09T18:30:38.303

The answer is fairly simple: Your threads are fighting for memory bandwidth!

Consider that you perform one floating point addition per 2 stores (one initialization, one after the addition) and 2 reads (in the addition). Most modern systems providing multiple cpus actually have to share the memory controller among several cores.

The following was run on a system with 2 physical CPU sockets and 12 cores (24 with HT). Your original code exhibits exactly your problem:

Thread 1 of 1  took 657msec
Thread 1 of 2  took 1447msec
Thread 2 of 2  took 1463msec
[...]
Thread 1 of 8  took 5516msec
Thread 2 of 8  took 5587msec
Thread 3 of 8  took 5205msec
Thread 4 of 8  took 5311msec
Thread 5 of 8  took 2731msec
Thread 6 of 8  took 5545msec
Thread 7 of 8  took 5551msec
Thread 8 of 8  took 4903msec

However, by simply increasing the arithmetic density, we can see a significant increase in scalability. To demonstrate, I changed your addition routine to also perform an exponentiation: *(++a) += std::exp(*(++b));. The result shows almost perfect scaling:

Thread 1 of 1  took 7671msec
Thread 1 of 2  took 7759msec
Thread 2 of 2  took 7759msec
[...]
Thread 1 of 8  took 9997msec
Thread 2 of 8  took 8135msec
Thread 3 of 8  took 10625msec
Thread 4 of 8  took 8169msec
Thread 5 of 8  took 10054msec
Thread 6 of 8  took 8242msec
Thread 7 of 8  took 9876msec
Thread 8 of 8  took 8819msec

But what about ICC?

First, ICC inlines simplesum. Proving that inlining happens is simple: Using icc, I have disable multi-file interprocedural optimization and moved simplesum into its own translation unit. The difference is astonishing. The performance went from

Thread 1 of 1  took 687msec
Thread 1 of 2  took 688msec
Thread 2 of 2  took 689msec
[...]
Thread 1 of 8  took 690msec
Thread 2 of 8  took 697msec
Thread 3 of 8  took 700msec
Thread 4 of 8  took 874msec
Thread 5 of 8  took 878msec
Thread 6 of 8  took 874msec
Thread 7 of 8  took 742msec
Thread 8 of 8  took 868msec

To

Thread 1 of 1  took 1278msec
Thread 1 of 2  took 2457msec
Thread 2 of 2  took 2445msec
[...]
Thread 1 of 8  took 8868msec
Thread 2 of 8  took 8434msec
Thread 3 of 8  took 7964msec
Thread 4 of 8  took 7951msec
Thread 5 of 8  took 8872msec
Thread 6 of 8  took 8286msec
Thread 7 of 8  took 5714msec
Thread 8 of 8  took 8241msec

This already explains why the library performs badly: ICC cannot inline it and therefore no matter what else causes ICC to perform better than g++, it will not happen.

It also gives a hint as to what ICC might be doing right here... What if instead of executing simplesum 1000 times, it interchanges the loops so that it

Loads two doubles
Adds them 1000 times (or even performs a = 1000 * b)
Stores two doubles

This would increase arithmetic density without adding any exponentials to the function... How to prove this? Well, to begin let us simply implement this optimization and see what happens! To analyse, we will look at the g++ performance. Recall our benchmark results:

Thread 1 of 1  took 640msec
Thread 1 of 2  took 1308msec
Thread 2 of 2  took 1304msec
[...]
Thread 1 of 8  took 5294msec
Thread 2 of 8  took 5370msec
Thread 3 of 8  took 5451msec
Thread 4 of 8  took 5527msec
Thread 5 of 8  took 5174msec
Thread 6 of 8  took 5464msec
Thread 7 of 8  took 4640msec
Thread 8 of 8  took 4055msec

And now, let us exchange

for (std::size_t n{0}; n < 1000; ++n){
    simplesum(pA, pB, dim);
}

with the version in which the inner loop was made the outer loop:

double* a = pA; double* b = pB;
for(std::size_t i{0}; i < dim; ++i, ++a, ++b)
{
    double x = *a, y = *b;
    for (std::size_t n{0}; n < 1000; ++n)
    {
        x += y;
    }
    *a = x;
}

The results show that we are on the right track:

Thread 1 of 1  took 693msec
Thread 1 of 2  took 703msec
Thread 2 of 2  took 700msec
[...]
Thread 1 of 8  took 920msec
Thread 2 of 8  took 804msec
Thread 3 of 8  took 750msec
Thread 4 of 8  took 943msec
Thread 5 of 8  took 909msec
Thread 6 of 8  took 744msec
Thread 7 of 8  took 759msec
Thread 8 of 8  took 904msec

This proves that the loop interchange optimization is indeed the main source of the excellent performance ICC exhibits here.

Note that none of the tested compilers (MSVC, ICC, g++ and clang) will replace the loop with a multiplication, which improves performance by 200x in the single threaded and 15x in the 8-threaded cases. This is due to the fact that the numerical instability of the repeated additions may cause wildly differing results when replaced with a single multiplication. When testing with integer data types instead of floating point data types, this optimization happens.

How can we force g++ to perform this optimization?

Interestingly enough, the true killer for g++ is not an inability to perform loop interchange. When called with -floop-interchange, g++ can perform this optimization as well. But only when the odds are significantly stacked into its favor.

Instead of std::size_t all bounds were expressed as ints. Not long, not unsigned int, but int. I still find it hard to believe, but it seems this is a hard requirement.
Instead of incrementing pointers, index them: a[i] += b[i];
G++ needs to be told -floop-interchange. A simple -O3 is not enough.

When all three criteria are met, the g++ performance is similar to what ICC delivers:

Thread 1 of 1  took 714msec
Thread 1 of 2  took 724msec
Thread 2 of 2  took 721msec
[...]
Thread 1 of 8  took 782msec
Thread 2 of 8  took 1221msec
Thread 3 of 8  took 1225msec
Thread 4 of 8  took 781msec
Thread 5 of 8  took 788msec
Thread 6 of 8  took 1262msec
Thread 7 of 8  took 1226msec
Thread 8 of 8  took 820msec

Note: The version of g++ used in this experiment is 4.9.0 on a x64 Arch linux.

Except in his original code, they all perform *much* faster than the "scaling" code :) — Moo-Juice, May 08 '14 at 20:09
Thanks for the reply, I thought it might have to do with memory bandwidth. What I tried is using openmp to parallelize the for loop in the simplesum method. Using just one thread in the main program and increasing the number of threads for this loop I observed a good speed-up. So, my interpretation was: if the problem has to do with parallel memory access one should see this with the parallelized for loop as well, because it accesses the same memory. However, that was not the case. — Fabian, May 08 '14 at 21:04
@Fabian I have had a look at the binary icc produces and added an explanation for why icc performs better here. — danielschemmel, May 08 '14 at 21:14
@gha.st Thanks for the excellent answer. Well, this explains why icc behaves better for this example. I understand that one can exchange the two loops in this example. Unfortunately that is not an option for my real code, because here the only purpose of the outer loop was to increase the runtime. In my actual program there is no such a simple outer loop :-( — Fabian, May 08 '14 at 21:32
Wish I could upvote this more, fantastic answer. If you look at the massively popular question & answer on Branch Prediction, ICC also uses this trick to optimise away the problem. — Moo-Juice, May 08 '14 at 21:40
Source: http://stackoverflow.com/questions/11227809/why-is-processing-a-sorted-array-faster-than-an-unsorted-array?lq=1 — Moo-Juice, May 08 '14 at 21:46
@Fabian Ha! I found out how to trick g++ into performing loop interchange, maybe this can help you (although it seems far-fetched). — danielschemmel, May 08 '14 at 22:25
Well, I would be interested just out of curiosity. For my actual problem I cannot rely on the loop interchange so I came up with a workaround speeding things up a lot (see my answer). Thanks a lot for your efforts, your input really helped me a lot :-) — Fabian, May 09 '14 at 17:47

score 0 · Answer 2 · answered May 09 '14 at 17:41

Ok, I came to the conclusion that the main problem is that the processor acts on different parts of the memory in parallel and hence I assume that one has to deal with lots of cache misses which slows the process further down. Putting the actual sum function in a critical section

summutex.lock();
simplesum(pA, pB, dim);
summutex.unlock();

solves the problem of the cache missses, but of course does not yield optimal speed-up. Anyway, because now the other threads are blocked the simplesum method might as well use all available threads for the sum

void simplesum(double* a, double* b, std::size_t dim, std::size_t numberofthreads){

    omp_set_num_threads(numberofthreads);
    #pragma omp parallel 

    {
    #pragma omp for
    for(std::size_t i = 0; i < dim; ++i)
    {
      a[i]+=b[i];

    }
    }
}

In this case all the threads work on the same chunk on memory: it should be in the processor cache and if the processor needs to load some other parts of the memory into its cache the other threads benefit from this all well (depending whether this is L1 or L2 cache, but I reckon the details do not really matter for the sake of this discussion).

I don't claim that this solution is perfect or anywhere near optimal, but it seems to work much better than the original code. And it does not rely on some loop switching tricks which I cannot do in my actual code.

No speedup for vector sums with threading

2 Answers2

But what about ICC?

How can we force g++ to perform this optimization?