Why is OpenMP slow in this case?

Question

I am trying to understand why OpenMP works the way it does in the following example.

#include <omp.h>
#include <iostream>
#include <vector>
#include <stdlib.h>

void AddVectors (std::vector< double >& v1,
                 std::vector< double >& v2) {

    size_t i;

#pragma omp parallel for private(i)
    for (i = 0; i < v1.size(); i++) v1[i] += v2[i];

}


int main (int argc, char** argv) {

    size_t N1 = atoi(argv[1]);

    std::vector< double > v1(N1,1);
    std::vector< double > v2(N1,2);

    for (size_t i = 0; i < N1; i++) AddVectors(v1,v2);

    return 0;

}

I first compiled the code above without enabling OpenMP (by omitting -fopenmp on the compiling flags). The execution time for N1 = 10000 was 0.1s. Enabling OpenMP makes the execution time go beyond 1 minute. I stopped it before it was done (got tired of waiting...).

I am compiling the code as below:

g++ -std=c++0x -O3 -funroll-loops -march=core2 -fomit-frame-pointer -Wall -fno-strict-aliasing -o main.o -c main.cpp

g++ main.o -o main

Not all these flags are necessary here but I'm using them on the project I'm trying to parallelize and I use those flags there. That's why I decided to leave them here. Also, I add -fopenmp to enable OpenMP on the compilation.

Does anybody know what's going wrong? Thank you!

Just a guess, but: have you tried storing the loop maximum (`v1.size()`) in a temporary variable before the loop, and using that variable in the for-loop clause? Maybe the compiler can't see that the return value from size() doesn't change for some reason. — pmdj, Mar 22 '11 at 19:59
Heh, that was my guess too, but I've tried it and it's the same. — Damon, Mar 22 '11 at 20:09
Thank you for the reply. Unfortunately that made no difference here.. — Diego, Mar 22 '11 at 20:19
your code now looks weird. N1 as the length of the vector and the number of runs of the loop? Is this intentional? — Ronny Brendel, Mar 22 '11 at 20:21
Which sizes of vectors have you tried? You could also try to change the reference into pointer. I don't know if it makes a difference, but I'd try. — Ronny Brendel, Mar 22 '11 at 20:21
Have you tried experimenting with the various OpenMP environment flags? — pmdj, Mar 22 '11 at 20:21
Can't reproduce this, works fine here (with N1=100000, 17s without openmp, 4.7s with Core2 Quad). What version of GCC do you have? What type of CPU — Mat, Mar 22 '11 at 20:26
Ronny: I had a bug before -- i was setting N1 and N2 to atoi(argv[1]) so I ditched N2. Looks weird, but does not change my results. — Diego, Mar 22 '11 at 20:27
pmjordan: all I play with is OMP_NUM_THREADS... When set to 1, I get the same time as without -fopenmp (duh!). When set to 4 (I have a four core cpu), it takes forever to finish... — Diego, Mar 22 '11 at 20:29
Mat: g++ (Ubuntu 4.4.3-4ubuntu5) 4.4.3; Intel(R) Core(TM) i5 CPU M 450 — Diego, Mar 22 '11 at 20:29
BTW, This thread seems like it could help: http://stackoverflow.com/questions/870952/openmp-terrible-performance-a-simple-issue-of-overhead-or-is-there-a-program — sinelaw, Mar 22 '11 at 20:33
Well, WorksForMe(tm) with G++ 4.5.1 (Gentoo). Irrelevant: there's a `` header for C++, I think it's more "appropriate" than ``. — Mat, Mar 22 '11 at 20:38
Mat: Thanks, I knew that ;) -- that was just junk I forgot to remove before posting this. — Diego, Mar 22 '11 at 20:40

score 2 · Answer 1 · answered Mar 30 '13 at 14:50

I have tried the same example on Visual Studio 2008. I did two modification to your code example, and it runs roughly 3 times faster with OpenMP, than without OpenMP.

Without being able to confirm it on GCC, the problem might be in main loop the function AddVectors is called, and each time it has to perform a "fork" operation, and this will take some measurable time. So if you have N1 = 10000, it has to spawn 10000 "fork" operations.

I have attached your own code snippet modified only to make it work under Visual Studio, and I added a print statement in the end to avoid the compiler removing all the code.

#include <omp.h>
#include <iostream>
#include <vector>
#include <stdlib.h>

void AddVectors (std::vector< double >& v1,
                 std::vector< double >& v2) {

    #pragma omp parallel for
    for (int i = 0; i < static_cast<int>(v1.size()); i++) v1[i] += v2[i];

}


int main (int argc, char** argv) {

    size_t N1 = atoi(argv[1]);

    std::vector< double > v1(N1,1);
    std::vector< double > v2(N1,2);

    for (size_t i = 0; i < N1; i++) AddVectors(v1,v2);


    printf("%g\n",v1[0]);
    return 0;

}

DOOM · Answer 2 · 2013-04-02T12:47:18.057

The problem is with array type you are using.

A vector is a container. It is a structure which stores several information like size, begin, end etc; and has several in-build function, where operator [] is one of them used to access the data. As a result the cache lines which tend to load say for index "i" of the vector V, loads the element V[i] and some information which is not being used in the code.

On the contrary, if you use classical arrays (dynamic/static), the operator [] results in loading only the data elements. As a result a cache line (usually 64 bytes long) will load 8 elements of this double array (size of double = 8 bytes).

See difference between _mm_malloc and malloc for enhancing data alignment.

@Mr Fooz I am not sure about that. Lets compare the performance results for both the cases:

4 Threads on i7 processor

Array Time Taken.: 0.122007 | Repeat: 4 | MFlops: 327.85

Vector Time Taken: 0.101006 | Repeat: 2 | MFlops: 188.669

I force the runtime to be more than 0.1sec, so the code repeats itself. The main loop:

const int N = 10000000;
timing(&wcs);
for(; runtime < 0.1; repeat*=2)
{
    for(int r = 0; r < repeat; r++)
    {
        #pragma omp parallel for
        for(int i = 0; i < N; i++)
        {
            A[i] += B[i];
        }
        if(A[0]==0) dummy(A[0]);
    }
    timing(&wce);
    runtime = wce-wcs;
}

MFLops: ((N*repeat)/runtime)/1000000

-1: vector is a 0-overhead container under most circumstances, relative to a pure C array implementation. The extra metadata (length and reserved size of the array) are stored with the vector instance, not within each element. — Mr Fooz, Apr 02 '13 at 03:00
-1 => +1: you appear to be correct. Interesting. Do you know why it loads any extra metadata? operator[] doesn't do bounds checking. — Mr Fooz, Apr 02 '13 at 17:58
@MrFooz: [Check this out](http://stackoverflow.com/questions/6632971/what-is-the-difference-between-stdarray-and-stdvector-when-do-you-use-one-o) — DOOM, Apr 03 '13 at 04:23

score 1 · Answer 3 · answered Mar 31 '11 at 07:35

1

May be g++ optimized out whole AddVectors calls? Try to return last v1 element and store it in volatile variable.

answered Mar 31 '11 at 07:35

Sergey Miryanov

1,820
16
29

Why is OpenMP slow in this case?

3 Answers3