OpenMP embarrassingly parallel for loop, no speedup

Question

I have what seems to be a very simple parallel for loop, which is just writing zeros to an integer array. But it turns out the more threads, the slower the loop gets. I thought that this was due to some cache thrashing so I played around with schedules, chunk size, __restrict__, nesting the parallel for inside a parallel block, and flushes. Then I noticed that reading an array to do a reduction is also slower.

This should be obviously very simple and should speed up nearly linearly. What am I missing here?

Full code:

#include <omp.h>
#include <vector>
#include <iostream>
#include <ctime>

void tic(), toc();

int main(int argc, const char *argv[])
{
    const int COUNT = 100;
    const size_t sz = 250000 * 200;
    std::vector<int> vec(sz, 1);

    std::cout << "max threads: " << omp_get_max_threads()<< std::endl;

    std::cout << "serial reduction" << std::endl;
    tic();
    for(int c = 0; c < COUNT; ++ c) {
        double sum = 0;
        for(size_t i = 0; i < sz; ++ i)
            sum += vec[i];
    }
    toc();

    int *const ptr = vec.data();
    const int sz_i = int(sz); // some OpenMP implementations only allow parallel for with int

    std::cout << "parallel reduction" << std::endl;
    tic();
    for(int c = 0; c < COUNT; ++ c) {
        double sum = 0;
        #pragma omp parallel for default(none) reduction(+:sum)
        for(int i = 0; i < sz_i; ++ i)
            sum += ptr[i];
    }
    toc();

    std::cout << "serial memset" << std::endl;
    tic();
    for(int c = 0; c < COUNT; ++ c) {
        for(size_t i = 0; i < sz; ++ i)
            vec[i] = 0;
    }
    toc();

    std::cout << "parallel memset" << std::endl;
    tic();
    for(int c = 0; c < COUNT; ++ c) {
        #pragma omp parallel for default(none)
        for(int i = 0; i < sz_i; ++ i)
            ptr[i] = 0;
    }
    toc();

    return 0;
}

static clock_t ClockCounter;

void tic()
{
    ClockCounter = std::clock();
}

void toc()
{
    ClockCounter = std::clock() - ClockCounter;
    std::cout << "\telapsed clock ticks: " << ClockCounter << std::endl;
}

Running this yields:

g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1
./omp_test
max threads: 12
serial reduction
  elapsed clock ticks: 1790000
parallel reduction
  elapsed clock ticks: 19690000
serial memset
  elapsed clock ticks: 3860000
parallel memset
  elapsed clock ticks: 20800000

If I run with -O2, g++ is able to optimize the serial reduction away and I get zero time, thus -O1. Also, putting omp_set_num_threads(1); makes the times more similar, although there are still some differences:

g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1
./omp_test
max threads: 1
serial reduction
  elapsed clock ticks: 1770000
parallel reduction
  elapsed clock ticks: 7370000
serial memset
  elapsed clock ticks: 2290000
parallel memset
  elapsed clock ticks: 3550000

This should be fairly obvious, I feel like I'm not seeing something very elementary. My CPU is Intel(R) Xeon(R) CPU E5-2640 0 @ 2.50GHz with hyperthreading, but the same behavior is observed at a colleague's i5 with 4 cores and no hyperthreading. We're both running Linux.

EDIT

It seems that one err was on the side of timing, running with:

static double ClockCounter;

void tic()
{
    ClockCounter = omp_get_wtime();//std::clock();
}

void toc()
{
    ClockCounter = omp_get_wtime()/*std::clock()*/ - ClockCounter;
    std::cout << "\telapsed clock ticks: " << ClockCounter << std::endl;
}

yields more "reasonable" times:

g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1
./omp_test
max threads: 12
serial reduction
  elapsed clock ticks: 1.80974
parallel reduction
  elapsed clock ticks: 2.07367
serial memset
  elapsed clock ticks: 2.37713
parallel memset
  elapsed clock ticks: 2.23609

But still, there is no speedup, it is just not slower anymore.

EDIT2:

As suggested by user8046, the code is heavily memory bound. and as suggested by Z boson, the serial code is easily optimized away and it is not certain what is being measured here. So I did a small change of putting the sum outside of the loop, so that it does not zero out at every iteration over c. I also replaced the reduction operation with sum+=F(vec[i]) and memset operation with vec[i] = F(i). Running as:

g++ omp_test.cpp -o omp_test --ansi -pedantic -fopenmp -O1 -D"F(x)=sqrt(double(x))"
./omp_test
max threads: 12
serial reduction
  elapsed clock ticks: 23.9106
parallel reduction
  elapsed clock ticks: 3.35519
serial memset
  elapsed clock ticks: 43.7344
parallel memset
  elapsed clock ticks: 6.50351

Calculating the square root adds more work to the threads and there is finally some reasonable speedup (it is about 7x, which makes sense, as the hyperthreaded cores share memory lanes).

How is a reduction embarrassingly parallel? There is a (by necessity) lock on the sum variable. — IdeaHat, Sep 02 '14 at 13:35
@MadScienceDreams You are right, when I wrote the question title, I was only experimenting with (embarrassingly parallel) writes to an array, which the title is referring to. The reduction experiment occurred to me later. But still, the reduction in double can be implemented as per-thread for loop with a private accumulator (which processes couple thousands of elements) and then tree-like or serial reduction of the per-thread partial sums (there is all 12 of them). This is nearly embarrassingly parallel (not sure if the compiler will implement it that way, though - I know I could). — the swine, Sep 02 '14 at 13:50
@HighPerformanceMark True, well spotted. I only used `clock()` instead of some more precise function to make it more simple ... fail. — the swine, Sep 02 '14 at 14:27
@HighPerformanceMark I tried that before (the question says "I tried ... nesting the parallel for inside a parallel block") and I tried it again now, but the timing is the same - the times are roughly equal :(. — the swine, Sep 02 '14 at 14:51
I get different results from yours on `g++` ver `4.8.2-19ubuntu1` and Xeon E5540. "parallel reduction" is 2x as fast as "serial reduction" — Brian Cain, Sep 02 '14 at 15:04
@BrianCain and what about the parallel memset? How fast is that? — the swine, Sep 02 '14 at 15:08
@HighPerformanceMark no worries, many thanks for the effort of trying to help me. — the swine, Sep 02 '14 at 15:09
@theswine, it's Mildly faster. `max threads: 16 serial reduction elapsed clock ticks: 3.81508 parallel reduction elapsed clock ticks: 1.91143 serial memset elapsed clock ticks: 4.37205 parallel memset elapsed clock ticks: 2.95767` — Brian Cain, Sep 02 '14 at 15:14
Define `sum` outside of the outer loop then use printf/cout to show the sum so that GCC does not optimize it away. Then use -O3 and -ffast-math. GCC won't vectorize reductions unless you use `-ffast-math`. This allows the floating point arithmetic to be associative. But using OpenMP on a reduction already breaks IEEE anyway. — Z boson, Sep 02 '14 at 17:50
In your EDIT2 your changed your problem. user8046 is wrong to claim "For the memset: you can saturate the memory with one thread, you will never see a speedup there." even on a single socket system. A single thread will not saturate the bandwidth. — Z boson, Sep 03 '14 at 08:59

fweik · Accepted Answer · 2014-09-02T15:52:43.720

8

You spotted the timing error. There is still no speedup because both of your test cases are heavily memory bound. On typical consumer hardware all of your cores share one memory bus, so using more threads does not give you more bandwidth and, since this is the bottleneck, speedup. This will probably change if you reduce your problem size so it will fit into cache or for sure if you increase the number of calculations per data, for example if you were calculating the reduction of exp(vec[i]) or 1/vec[i]. For the memset: you can saturate the memory with one thread, you will never see a speedup there. (Only if you have access to a second memory bus with more threads, as with some multi-socket architectures). One remark regarding the reduction, this is most probably not implemented with a lock, that would be horrible inefficient but using an addition tree which has not so bad logarithmic speedup.

edited Sep 02 '14 at 15:52

answered Sep 02 '14 at 15:10

fweik

441
3
5

An interesting thought ... so instead of `sum += vec[i]` I wrote `sum += sin(exp(sqrt(double(vec[i]))))` and similarly instead of `vec[i] = 0` I put `sin(exp(sqrt(double(i))))`, the type cast making sure that the "normal" versions of the math functions are used, and not the integer ones. This changes the times as follows: `serial reduction: 17.9 sec, parallel reduction: 99.4 sec, serial memset: 161.4 sec, parallel memset: 38.6 sec`. So I suppose you're right, at least as far as the memset is concerned. Will have to think about why the reduction got so much slower all of a sudden. Any thoughts? – the swine Sep 02 '14 at 15:31
2

@theswine, I should've mentioned that my E5540 is NUMA and that's why I probably saw a speedup. – Brian Cain Sep 02 '14 at 15:43
It sill believe that you have to take good care of optimization. It is still pretty easy to spot that the value of sum only depends on the last loop run. If I reset the sum variable only in the beginning, I get more reasonable results: max threads: 8 serial reduction elapsed clock ticks: 12.4259 parallel reduction elapsed clock ticks: 1.48168 serial memset elapsed clock ticks: 1.89196 parallel memset elapsed clock ticks: 1.56518 That's on Sandy Bridge. If written the total sum to std out so it can't get optimized away and used -O3 – fweik Sep 02 '14 at 15:50
Your statement "For the memset: you can saturate the memory with one thread, you will never see a speedup there." is wrong even on a single socket system. A single thread will not saturate the bandwidth. This is a common misconception. – Z boson Sep 03 '14 at 09:00

Z boson · Answer 2 · 2014-09-02T19:16:17.203

Besides your error in using the clock function in Linux the rest of your question can be answered by reading these questions/answers.

in-an-openmp-parallel-code-would-there-be-any-benefit-for-memset-to-be-run-in-p/11579987

measuring-memory-bandwidth-from-the-dot-product-of-two-arrays

memset-in-parallel-with-threads-bound-to-each-physical-core

So you should see a significant benefit from multiple threads with memset and doing a reduction even on a single socket system. I have written my own tool to measure bandwidth for this. You can find some of the results from my my i5-4250U (Haswell) with 2 cores below (GCC 4.8, Linux 3.13, EGLIBC 2.19) runing over 1 GB. vsum is your reduction. Notice that there is a significant improvement even on this two core system.

one thread

C standard library
                       GB    time(s)       GB/s     GFLOPS   efficiency
memset:              0.50       0.80       6.68       0.00        inf %
memcpy:              1.00       1.35       7.93       0.00        inf %

Agner Fog's asmlib
                       GB    time(s)       GB/s     GFLOPS   efficiency
memset:              0.50       0.71       7.53       0.00        inf %
memcpy:              1.00       0.93      11.51       0.00        inf %

my_memset   
                     0.50       0.71       7.53       0.00        inf %


FMA3 reduction tests
                       GB    time(s)       GB/s     GFLOPS   efficiency
vsum:                0.50       0.53      10.08       2.52        inf %
vmul:                0.50       0.68       7.93       1.98        inf %
vtriad:              0.50       0.70       7.71       3.85        inf %
dot                  1.00       1.08       9.93       2.48        inf %

two threads

C standard library
                       GB    time(s)       GB/s     GFLOPS   efficiency
memset:              0.50       0.64       8.33       0.00        inf %
memcpy:              1.00       1.10       9.76       0.00        inf %

Agner Fog's asmlib
                       GB    time(s)       GB/s     GFLOPS   efficiency
memset:              0.50       0.36      14.98       0.00        inf %
memcpy:              1.00       0.66      16.30       0.00        inf %

my_memset
                     0.50       0.36      15.03       0.00        inf %


FMA3 tests
standard sum tests with OpenMP: 2 threads
                       GB    time(s)       GB/s     GFLOPS   efficiency
vsum:                0.50       0.41      13.03       3.26        inf %
vmul:                0.50       0.39      13.67       3.42        inf %
vtriad:              0.50       0.44      12.20       6.10        inf %
dot                  1.00       0.97      11.11       2.78        inf %

Here is my custom memset function (I have several other tests like this).

void my_memset(int *s, int c, size_t n) {
    int i;
    __m128i v = _mm_set1_epi32(c);
    #pragma omp parallel for
    for(i=0; i<n/4; i++) {
        _mm_stream_si128((__m128i*)&s[4*i], v);
    }
}

Edit:

You should compile with -O3 and -ffast-math. Define the sum outside of the outerloop and then print it out so GCC does not optimize it away. GCC won't auto-vectorize a reduction because floating point arithmetic is not associative and vectorizing the loop could break IEEE floating point rules. Using -ffast-math allows floating point arithemetic to be associative which allows GCC to vectorize the reduction. It should be pointed out that already doing a reduction in OpenMP assumes the floating point arithmetic is associative so it already break IEEE floating point rules.

double sum = 0;
tic();
for(int c = 0; c < COUNT; ++ c) { 
    #pragma omp parallel for reduction(+:sum)
    for(int i = 0; i < sz_i; ++ i)
        sum += ptr[i];
}
toc();
printf("sum %f\n", sum);

Edit:

I tested your code and made some modifications. I get faster times with the reduction and memset using multiple threads

max threads: 4
serial reduction
dtime 1.86, sum 705032704
parallel reduction
dtime 1.39 s, sum 705032704
serial memset
dtime 2.95 s
parallel memset
dtime 2.44 s
serial my_memset
dtime 2.66 s
parallel my_memset
dtime 1.35 s

Here is the code I used (g++ foo.cpp -fopenmp -O3 -ffast-math)

#include <omp.h>
#include <vector>
#include <iostream>
#include <ctime>
#include <stdio.h>

#include <xmmintrin.h>

void my_memset(int *s, int c, size_t n) {
    int i;
    __m128i v = _mm_set1_epi32(c);
    for(i=0; i<n/4; i++) {
        _mm_stream_si128((__m128i*)&s[4*i], v);
    }
}

void my_memset_omp(int *s, int c, size_t n) {
    int i;
    __m128i v = _mm_set1_epi32(c);
    #pragma omp parallel for
    for(i=0; i<n/4; i++) {
        _mm_stream_si128((__m128i*)&s[4*i], v);
    }
}

int main(int argc, const char *argv[])
{
    const int COUNT = 100;
    const size_t sz = 250000 * 200;
    std::vector<int> vec(sz, 1);

    std::cout << "max threads: " << omp_get_max_threads()<< std::endl;

    std::cout << "serial reduction" << std::endl;
    double dtime;
    int sum;

    dtime = -omp_get_wtime();
    sum = 0;
    for(int c = 0; c < COUNT; ++ c) {
        for(size_t i = 0; i < sz; ++ i)
            sum += vec[i];
    }
    dtime += omp_get_wtime();
    printf("dtime %.2f, sum %d\n", dtime, sum);

    int *const ptr = vec.data();
    const int sz_i = int(sz); // some OpenMP implementations only allow parallel for with int

    std::cout << "parallel reduction" << std::endl;


    dtime = -omp_get_wtime();
    sum = 0;
    for(int c = 0; c < COUNT; ++ c) {
        #pragma omp parallel for default(none) reduction(+:sum)
        for(int i = 0; i < sz_i; ++ i)
            sum += ptr[i];
    }
    dtime += omp_get_wtime();
    printf("dtime %.2f s, sum %d\n", dtime, sum);

    std::cout << "serial memset" << std::endl;

    dtime = -omp_get_wtime();
    for(int c = 0; c < COUNT; ++ c) {
        for(size_t i = 0; i < sz; ++ i)
            vec[i] = 0;
    }   
    dtime += omp_get_wtime();
    printf("dtime %.2f s\n", dtime);

    std::cout << "parallel memset" << std::endl;
    dtime = -omp_get_wtime();
    for(int c = 0; c < COUNT; ++ c) {
        #pragma omp parallel for default(none)
        for(int i = 0; i < sz_i; ++ i)
            ptr[i] = 0;
    }
    dtime += omp_get_wtime();
    printf("dtime %.2f s\n", dtime);

    std::cout << "serial my_memset" << std::endl;

    dtime = -omp_get_wtime();
    for(int c = 0; c < COUNT; ++ c) my_memset(ptr, 0, sz_i);

    dtime += omp_get_wtime();
    printf("dtime %.2f s\n", dtime);

    std::cout << "parallel my_memset" << std::endl;
    dtime = -omp_get_wtime();
    for(int c = 0; c < COUNT; ++ c) my_memset_omp(ptr, 0, sz_i);
    dtime += omp_get_wtime();
    printf("dtime %.2f s\n", dtime);

    return 0;
}

This does not really answer my question. I know it **should** be faster, as you say, but I don't know why it **isn't**. — the swine, Sep 02 '14 at 16:35
@theswine, I understand, you will have to give me a little time, I don't know if I can debug your code today. But if you dig through those question/answers I linked you will probably find the answer. — Z boson, Sep 02 '14 at 16:40
@theswine, I add some text to the end of my answer that may explain why you're not seeing an improvement. I'll try and look more into this later. — Z boson, Sep 02 '14 at 18:01
@theswine,okay, I ran your code and made a few changes. See my updated answer. I get faster times with the parallel versions. Test on your system and see. — Z boson, Sep 02 '14 at 19:17
thanks for trying my code, I see that you get similar results as I do. The SSE version is nice, although it requires data alignment and as such is not really universally portable with std::vector as storage. So why does SSE make it faster? I understand speedup of SSE, compared to no SSE, but I don't understand why it gets more speedup when parallelized. It is essentially the same code, no? Unless of course the 128 B word matches the cache line. What would happen if you would parallelize the non-SSE code with `schedule(static, 4)`? — the swine, Sep 03 '14 at 08:14
@theswine, why do you say we get similar results? After you fixed you're timer your parallel reduction was still slower. My is much faster. My parallel memset (not `my_memtset`) is also significant faster. I tested this on my single socket 2 core Haswell system. Did you test the code I posted? Can you update our question with the results? — Z boson, Sep 03 '14 at 08:38
@theswine, as to SSE this has nothing to do with using SSE. GLIBC (or EGLIBC) likely use SSE anyway. This is about temporal vs. non temporal stores. GLIBC is using temporal stores using `_mm_store_ps` and `my_memset` is using non temporal stores using `_mm_stream_ps`. Temporal stores have to read in order to write. Non-temporal stores only write. Use non-temporal stores when you have lots of a large amount of data which does not fit in the cache. — Z boson, Sep 03 '14 at 08:41
well, I would not say that 2.4 seconds is much improvement over 2.9 seconds, it is only about 1.2x faster. — the swine, Sep 03 '14 at 08:41
I see. Is there a portable "C"/C++ way to perform non-temporal store on a single `int`, which would still be faster? — the swine, Sep 03 '14 at 08:42
@theswine, `my_memset` is not meant to be a replacement for `memset`. It's to demonstrate the potential using non-temporal stores. If you want a replacement for `memset` use Agner Fog's asmlib or write your own. To fix `my_memset` you have to check for alignment and adjust if not aligned, then check the size to decide on temporal or non-temporal stores, and finally cleanup for sizes which are not a multiple of the SIMD width. If you look at the source code of the asmlib you will see it does all this. — Z boson, Sep 03 '14 at 08:44
I'm not trying to implement memset, this is a part of more complicated image processing / computer vision code implemented by a colleague of mine. This simple test was what the complicated code boiled down to and at the time it seemed strange that there was no speedup. — the swine, Sep 03 '14 at 08:46
@theswine, well there is a speedup now. Your serial reduction was 1.8 and parallel was 2.1. On my system with my code (using -O3 and -ffast-math) I get 1.86 s for the serial reduction and 1.39 s with the parallel reduction. That's not a huge improvement but it's significant. Most people think you can't get any improvement with multiple threads when it's memory bound but that's not correct. But if you want to see a big improvement you have to run on a multi-socket system. — Z boson, Sep 03 '14 at 08:54
Thanks for all the effort, I think I now understand what was wrong. Using `-ffast-math` is out of a question, as there are some computations going on, and we would probably need to carefully separate the code to different files and compile them differently. My pain now is that I can't accept both your answer and user8046's. I'm going to accept his answer in order to motivate him as a fresh user, but I appreciate your answer as well. Thank you. — the swine, Sep 03 '14 at 08:59
@theswine, no problem. In this case `-ffast-math` makes little difference. It's `-O` vs. `-O3` that makes the difference. — Z boson, Sep 03 '14 at 09:02

score 1 · Answer 3 · answered Sep 02 '14 at 14:05

1

You are using std::clock which reports CPU time used, not real time. As such each processor's time is added up and will always be higher than single threaded time (due to overhead).

http://en.cppreference.com/w/cpp/chrono/c/clock

answered Sep 02 '14 at 14:05

dohashi

1,771
8
12

OpenMP embarrassingly parallel for loop, no speedup

3 Answers3

Linked