Why a simple for loop without OpenMP is faster than it with OpenMP

Question

Here is my test code for OpenMP

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#include <time.h>


int main(int argc, char const *argv[]){

    double x[10000];
    clock_t start, end;
    double cpu_time_used;
    start = clock();

    #pragma omp parallel
    #pragma omp for
    for (int i = 0; i < 10000; ++i){
        x[i]    = 1;
    }

    end = clock();
    cpu_time_used = ((double) (end - start)) / CLOCKS_PER_SEC;
    printf("%lf\n", cpu_time_used);
    return 0;
}

I compiled the code with the following two commands:

gcc test.c -o main

The output of rum main is 0.000039

Then I compiled with OpenMP

gcc test.c -o main -fopenmp

and the output is 0.008020

Could anyone help me understand why it happens. Thanks beforehand.

I think you need to brackets after the `#pragma omp parallel` statement surrounding the `#pragma omp for` loop — Tah, Dec 17 '16 at 07:23
Do you mean like this? `#pragma omp parallel{......}`, I tried and it is the same and doesn't work. — Fly_back, Dec 17 '16 at 07:30
It's been awhile since I've used openMP, but the biggest cost of your code is the overhead management of threads. You will see greater yield at much larger executions (try something like 1 million). — Tah, Dec 17 '16 at 07:32
OpenMP comes with some run-time overhead and for such a small trivial problem as your test it's probably still lacing up its track shoes while the serial program is having its post-race cigarette. Try much larger problem sizes, with heavier loads inside the loops and, after reference to the topic in other questions and answers hereabouts, don't use `clock` to time parallel codes. — High Performance Mark, Dec 17 '16 at 07:32
Thanks both. So @HighPerformanceMark which function I should use to time the OpenMP code? — Fly_back, Dec 17 '16 at 07:40
As you are totalling up the time spent in all threads, you would expect CPU time to increase with number of threads (as HPM hinted). omp_get_wtime() ought to pick a suitable timer for your platform. — tim18, Dec 17 '16 at 11:57

score 1 · Answer 1 · answered Dec 17 '16 at 16:31

As High Performance Mark so eloquently described in his comment, there is a cost (overhead) with creating threads and distributing work. For such a tiny piece of work (39 us), the overhead outweighs any possible gains.

That said, your measurement is also misleading. clock measures CPU time and is most likely not what you wanted (wall clock). For more details, see this question.

Another misconception that you might have: As soon as x is large enough, the simple loop will become memory-bound. And you will likely not see the speedup you expect. For example on a typical desktop system with four cores you might see a speedup of 1.5 x instead of 4 x.

A large overhead is associated with creating threads, which normally happens only at the first parallel region. A more realistic measure of the overhead (assuming that your code has more than one parallel region) is to have an empty parallel region before you start your timing, so that the threads have been created and you're just measuring the normal cost of waking them up. — Jim Cownie, Dec 19 '16 at 09:48

Why a simple for loop without OpenMP is faster than it with OpenMP

1 Answers1