22

It happened to me a few times to parallelize portion of programs with OpenMP just to notice that in the end, despite the good scalability, most of the foreseen speed-up was lost due to the poor performance of the single threaded case (if compared to the serial version).

The usual explanation that appears on the web for this behavior is that the code generated by compilers may be worse in the multi-threaded case. Anyhow I am not able to find anywhere a reference that explains why the assembly may be worse.

So, what I would like to ask to the compiler guys out there is:

May compiler optimizations be inhibited by multi-threading? In case, how could performance be affected?

If it could help narrowing down the question I am mainly interested in high-performance computing.

Disclaimer: As stated in the comments, part of the answers below may become obsolete in the future as they briefly discuss the way in which optimizations are handled by compilers at the time the question was posed.

Massimiliano
  • 7,842
  • 2
  • 47
  • 62
  • If you just included your code sample and profiling results, it would be much more answerable. – GManNickG May 29 '13 at 07:23
  • @GManNickG Honestly, I don't have ready a sample small enough to be posted here. Anyhow, I would be more interested in seeing the simplest possible examples of compiler optimizations that are prevented by a multi-threaded environment (with a brief explanation of why they are prevented), rather than a solution to a particular problem. The hope is to be able to extrapolate the knowledge gained from these examples to a real project. – Massimiliano May 29 '13 at 08:01
  • Not at all sure that ths is on-topic for SO, interesting though it might be. It's more like doing research. – Martin James May 29 '13 at 09:33
  • I don't know much about OpenMP, does it use user threads or kernel threads? If it's user threads, and you're using a lot of system calls, then expect to have poor performance – stormCloud May 29 '13 at 09:41
  • Your question is too broad and besides makes bad future reference since compilers evolve with every single day. A compiler that is not able to do something within OpenMP regions today might do it tomorrow - see the reference about GCC in my comment to raxman's post. – Hristo Iliev May 29 '13 at 11:24
  • This question is far too broad. As a general rule, if an entire book could be written to answer your question, it doesn't belong on SO. And that's certainly the case here. – Ben Voigt May 29 '13 at 16:14
  • @HristoIliev I tried to rephrase the question to take into account your comments. Please tell me if you think the question now is less prone to be a future "bad reference" or if its scope was narrowed enough. – Massimiliano May 29 '13 at 17:01
  • @Massimiliano: I think the new question's scope is much better. – Ben Voigt May 29 '13 at 18:44
  • It is still too broad. Optimisations are highly compiler/backend-specific and OpenMP regions can be implemented in multitude of ways, not only by code outlining, e.g. what PGI does. I think the question is interesting but not fit for SO. It could make a great blog post though. – Hristo Iliev May 29 '13 at 23:21
  • By the way, by "bad reference" I mean "not quite useful as per SO guidelines" and not "wrong". Simply put, things are going to change and the answers are to become irrelevant to future compiler versions. – Hristo Iliev May 29 '13 at 23:25
  • 1
    Opportunistic writes (cases where optimizations might cause a write to memory that otherwise would not have occurred) have to be disabled when compiling multi-threaded code, otherwise it's impossible to write sane multi-threaded code. – David Schwartz Apr 29 '14 at 20:26

4 Answers4

7

I think this answer describes the reason sufficiently, but I'll expand a bit here.

Before, however, here's gcc 4.8's documentation on -fopenmp:

-fopenmp
Enable handling of OpenMP directives #pragma omp in C/C++ and !$omp in Fortran. When -fopenmp is specified, the compiler generates parallel code according to the OpenMP Application Program Interface v3.0 http://www.openmp.org/. This option implies -pthread, and thus is only supported on targets that have support for -pthread.

Note that it doesn't specify disabling of any features. Indeed, there is no reason for gcc to disable any optimization.

The reason however why openmp with 1 thread has overhead with respect to no openmp is the fact that the compiler needs to convert the code, adding functions so it would be ready for cases with openmp with n>1 threads. So let's think of a simple example:

int *b = ...
int *c = ...
int a = 0;

#omp parallel for reduction(+:a)
for (i = 0; i < 100; ++i)
    a += b[i] + c[i];

This code should be converted to something like this:

struct __omp_func1_data
{
    int start;
    int end;
    int *b;
    int *c;
    int a;
};

void *__omp_func1(void *data)
{
    struct __omp_func1_data *d = data;
    int i;

    d->a = 0;
    for (i = d->start; i < d->end; ++i)
        d->a += d->b[i] + d->c[i];

    return NULL;
}

...
for (t = 1; t < nthreads; ++t)
    /* create_thread with __omp_func1 function */
/* for master thread, don't create a thread */
struct master_data md = {
    .start = /*...*/,
    .end = /*...*/
    .b = b,
    .c = c
};

__omp_func1(&md);
a += md.a;
for (t = 1; t < nthreads; ++t)
{
    /* join with thread */
    /* add thread_data->a to a */
}

Now if we run this with nthreads==1, the code effectively gets reduced to:

struct __omp_func1_data
{
    int start;
    int end;
    int *b;
    int *c;
    int a;
};

void *__omp_func1(void *data)
{
    struct __omp_func1_data *d = data;
    int i;

    d->a = 0;
    for (i = d->start; i < d->end; ++i)
        d->a += d->b[i] + d->c[i];

    return NULL;
}

...
struct master_data md = {
    .start = 0,
    .end = 100
    .b = b,
    .c = c
};

__omp_func1(&md);
a += md.a;

So what are the differences between the no openmp version and the single threaded openmp version?

One difference is that there is extra glue code. The variables that need to be passed to the function created by openmp need to be put together to form one argument. So there is some overhead preparing for the function call (and later retrieving data)

More importantly however, is that now the code is not in one piece any more. Cross-function optimization is not so advanced yet and most optimizations are done within each function. Smaller functions means there is smaller possibility to optimize.


To finish this answer, I'd like to show you exactly how -fopenmp affects gcc's options. (Note: I'm on an old computer now, so I have gcc 4.4.3)

Running gcc -Q -v some_file.c gives this (relevant) output:

GGC heuristics: --param ggc-min-expand=98 --param ggc-min-heapsize=128106
options passed:  -v a.c -D_FORTIFY_SOURCE=2 -mtune=generic -march=i486
 -fstack-protector
options enabled:  -falign-loops -fargument-alias -fauto-inc-dec
 -fbranch-count-reg -fcommon -fdwarf2-cfi-asm -fearly-inlining
 -feliminate-unused-debug-types -ffunction-cse -fgcse-lm -fident
 -finline-functions-called-once -fira-share-save-slots
 -fira-share-spill-slots -fivopts -fkeep-static-consts -fleading-underscore
 -fmath-errno -fmerge-debug-strings -fmove-loop-invariants
 -fpcc-struct-return -fpeephole -fsched-interblock -fsched-spec
 -fsched-stalled-insns-dep -fsigned-zeros -fsplit-ivs-in-unroller
 -fstack-protector -ftrapping-math -ftree-cselim -ftree-loop-im
 -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops=
 -ftree-reassoc -ftree-scev-cprop -ftree-switch-conversion
 -ftree-vect-loop-version -funit-at-a-time -fvar-tracking -fvect-cost-model
 -fzero-initialized-in-bss -m32 -m80387 -m96bit-long-double
 -maccumulate-outgoing-args -malign-stringops -mfancy-math-387
 -mfp-ret-in-387 -mfused-madd -mglibc -mieee-fp -mno-red-zone -mno-sse4
 -mpush-args -msahf -mtls-direct-seg-refs

and running gcc -Q -v -fopenmp some_file.c gives this (relevant) output:

GGC heuristics: --param ggc-min-expand=98 --param ggc-min-heapsize=128106
options passed:  -v -D_REENTRANT a.c -D_FORTIFY_SOURCE=2 -mtune=generic
 -march=i486 -fopenmp -fstack-protector
options enabled:  -falign-loops -fargument-alias -fauto-inc-dec
 -fbranch-count-reg -fcommon -fdwarf2-cfi-asm -fearly-inlining
 -feliminate-unused-debug-types -ffunction-cse -fgcse-lm -fident
 -finline-functions-called-once -fira-share-save-slots
 -fira-share-spill-slots -fivopts -fkeep-static-consts -fleading-underscore
 -fmath-errno -fmerge-debug-strings -fmove-loop-invariants
 -fpcc-struct-return -fpeephole -fsched-interblock -fsched-spec
 -fsched-stalled-insns-dep -fsigned-zeros -fsplit-ivs-in-unroller
 -fstack-protector -ftrapping-math -ftree-cselim -ftree-loop-im
 -ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops=
 -ftree-reassoc -ftree-scev-cprop -ftree-switch-conversion
 -ftree-vect-loop-version -funit-at-a-time -fvar-tracking -fvect-cost-model
 -fzero-initialized-in-bss -m32 -m80387 -m96bit-long-double
 -maccumulate-outgoing-args -malign-stringops -mfancy-math-387
 -mfp-ret-in-387 -mfused-madd -mglibc -mieee-fp -mno-red-zone -mno-sse4
 -mpush-args -msahf -mtls-direct-seg-refs

Taking a diff, we can see that the only difference is that with -fopenmp, we have -D_REENTRANT defined (and of course -fopenmp enabled). So, rest assured, gcc wouldn't produce worse code. It's just that it needs to add preparation code for when number of threads is greater than 1 and that has some overhead.


Update: I really should have tested this with optimization enabled. Anyway, with gcc 4.7.3, the output of the same commands, added -O3 will give the same difference. So, even with -O3, there are no optimization's disabled.

Community
  • 1
  • 1
Shahbaz
  • 46,337
  • 19
  • 116
  • 182
  • Great explanation for GCC @Shahbaz. It would be interesting to see a similar discussion regarding MSVC. –  May 29 '13 at 18:34
  • @raxman, thanks. I've only recently learned OpenMP, so I'm not much experienced with it. I have no idea about MSVC. However, if you can find an option that would output that would output enabled optimizations, you can test it. Otherwise, the general explanation is independent of gcc (it's more like my imagination of how OpenMP is probably implemented) – Shahbaz May 30 '13 at 12:55
  • I've only recently learned OpenMP as well (about 3 months now). I have used pthreads (on Windows as well) so your explanation is quite clear. I don't know how MS implements OpenMP in MSVC. I do know that its auto-vectorization conflicts with OpenMP so that's once difference between GCC and MSVC. I use MSVC by day and GCC by night so I'm stuck with both. –  May 30 '13 at 13:41
6

Short from the explicit pragmas for OMP, compilers just don't know that code can be executed by multiple threads. So they can neither make that code more nor less efficient.

This has grave consequences in C++. It is particularly a problem to library authors, they cannot reasonably guess up front whether their code will be used in a program that uses threading. Very visible when you read the source of a common C-runtime and standard C++ library implementation. Such code tends to be peppered with little locks all over the place to ensure the code still operates correctly when it is used in threads. You pay for this, even if you don't actually use that code in a threaded way. A good example is std::shared_ptr<>. You pay for the atomic update of the reference count, even if the smart pointer is only ever used in one thread. And the standard doesn't provide a way to ask for non-atomic updates, a proposal to add the feature was rejected.

And it very much is detrimental the other way as well, there isn't anything the compiler can do to ensure your own code is thread-safe. It is entirely up to you to make it thread-safe. Hard to do and this goes wrong in subtle and very difficult to diagnose ways all the time.

Big problems, not simple to solve. Maybe that's a good thing, otherwise everybody could be a programmer ;)

Hans Passant
  • 922,412
  • 146
  • 1,693
  • 2,536
3

That's a good question, even if it's rather broad, and I'm looking forward to hearing from the experts. I think @JimCownie had a good comment about this at the following discussion Reasons for omp_set_num_threads(1) slower than no openmp

Auto-vectorization and parallelization I think are often a problem. If you turn on Auto-parallelization in MSVC 2012 (auto-vectorization is on my default) they seem not to mix well together. Using OpenMP seems to disable the auto-vectorization of MSVC. The same maybe be true for GCC with OpenMP and auto-vectorization but I'm not sure.

I don't really trust auto-vectorization in the compiler anyway. One reason is that I'm not sure it does it does loop-unrolling to eliminate carried loop dependencies as well as scalar code. For this reason I try and do these things myself. I do the vectorization myself (using Agner Fog's vector class) and I unroll the loops myself. By doing this by hand I feel more confidant that I maximize all the parallelism: TLP (e.g. with OpenMP), ILP (e.g. by removing data dependencies with loop unrolling), and SIMD (with explicit SSE/AVX code).

Community
  • 1
  • 1
  • Older GCC versions couldn't prove the non-aliasing conditions for arrays in OpenMP regions and therefore couldn't properly vectorise, but recent versions fixed this - see [this question](http://stackoverflow.com/q/14861447/1374437). I really wouldn't try to outsmart the compiler when it comes to vectorisation, especially Intel's compiler. To achieve max performance one has to implement proper prefetching and sometimes employ non-temporal memory access. The compiler comes with heuristic engine that uses CPU-dependent cost functions and that is very hard to beat when programming by hand. – Hristo Iliev May 29 '13 at 11:19
  • Good information about GCC and auto-vectorization . However, I use ICC and GCC and in many cases writing my own SIMD code is faster, e.g. the only way to get efficient GEMM is by explicitly implementing SSE/AVX (my own code gets 70% efficiency -> better than Eigen now). The compiler is not going to do it for you. Also see the end of my answer for the tranpose of a matrix. http://stackoverflow.com/questions/16737298/what-is-the-fastest-way-to-transpose-a-matrix-in-c/16743203#16743203 The compiler will not figure that out either. –  May 29 '13 at 12:29
3

There is a lot of good information in the above, but the proper answer is the some optimizations MUST be turned off when compiling OpenMP. Some compilers, such as gcc, don't do that.

The example program at the end of this answer is searching for the value 81 in four non-overlapping ranges of integers. It should always find that value. However, on all gcc versions up to at least 4.7.2, the program sometimes does not terminate with the correct answer. To see for yourself, do this:

  • Copy the program into a file parsearch.c
  • Compile it with gcc -fopenmp -O2 parsearch.c
  • Run it with OMP_NUM_THREADS=2 ./a.out
  • Run it a few more (maybe 10) times, you will see two different answers coming out

Alternatively, you can compile without -O0, and see that the outcome is always correct.

Given that the program is free of race conditions, this behaviour of the compiler under -O2 is incorrect.

The behaviour is due to the global variable globFound. Please convince yourself that under expected execution, only one of the 4 threads in the parallel for writes to that variable. The OpenMP semantics define that if the global (shared) variable is written by only one thread, the value of the global variable after the parallel-for is the value that was written by that single thread. There is no communication between the threads through the global variable, and such would not be allowed as it gives rise to race conditions.

What the compiler optimization does under -O2 is that it estimated that writing to a global variable in a loop is expensive and therefore caches it in a register. This happens in the function findit, which, after optimization, will look like:

int tempo = globFound ;
for ( ... ) {
    if ( ...) {
        tempo = i;
    }
globFound = tempo;

But with this 'optimized' code, every thread does read and write globFound, and a race condition is introduced by the compiler itself.

Compiler optimizations do need to be aware of parallel execution. Excellent material about this is published by Hans-J. Boehm, under the general topic of memory consistency.

#include <stdio.h>
#define BIGVAL  (100 * 1000 * 1000)

int globFound ;

void findit( int from, int to )
{
    int i ;

    for( i = from ; i < to ; i++ ) {
        if( i*i == 81L ) {
            globFound = i ;
        }
    }
}

int main( int argc, char *argv )
{
    int p ;

    globFound = -1 ;

    #pragma omp parallel for
    for( p = 0 ; p < 4 ; p++ ) {
        findit( p * BIGVAL, (p+1) * BIGVAL ) ;
    }
    if( globFound == -1 ) {
        printf( ">>>>NO 81 TODAY<<<<\n\n" ) ;
    } else {
        printf( "Found! N = %d\n\n", globFound ) ;
    }
    return 0 ;
}
Marcel Beemster
  • 183
  • 1
  • 4