I think this answer describes the reason sufficiently, but I'll expand a bit here.
Before, however, here's gcc 4.8's documentation on -fopenmp
:
-fopenmp
Enable handling of OpenMP directives #pragma omp in C/C++ and !$omp in Fortran. When -fopenmp is specified, the compiler generates parallel code according to the OpenMP Application Program Interface v3.0 http://www.openmp.org/. This option implies -pthread, and thus is only supported on targets that have support for -pthread.
Note that it doesn't specify disabling of any features. Indeed, there is no reason for gcc to disable any optimization.
The reason however why openmp with 1 thread has overhead with respect to no openmp is the fact that the compiler needs to convert the code, adding functions so it would be ready for cases with openmp with n>1 threads. So let's think of a simple example:
int *b = ...
int *c = ...
int a = 0;
#omp parallel for reduction(+:a)
for (i = 0; i < 100; ++i)
a += b[i] + c[i];
This code should be converted to something like this:
struct __omp_func1_data
{
int start;
int end;
int *b;
int *c;
int a;
};
void *__omp_func1(void *data)
{
struct __omp_func1_data *d = data;
int i;
d->a = 0;
for (i = d->start; i < d->end; ++i)
d->a += d->b[i] + d->c[i];
return NULL;
}
...
for (t = 1; t < nthreads; ++t)
/* create_thread with __omp_func1 function */
/* for master thread, don't create a thread */
struct master_data md = {
.start = /*...*/,
.end = /*...*/
.b = b,
.c = c
};
__omp_func1(&md);
a += md.a;
for (t = 1; t < nthreads; ++t)
{
/* join with thread */
/* add thread_data->a to a */
}
Now if we run this with nthreads==1
, the code effectively gets reduced to:
struct __omp_func1_data
{
int start;
int end;
int *b;
int *c;
int a;
};
void *__omp_func1(void *data)
{
struct __omp_func1_data *d = data;
int i;
d->a = 0;
for (i = d->start; i < d->end; ++i)
d->a += d->b[i] + d->c[i];
return NULL;
}
...
struct master_data md = {
.start = 0,
.end = 100
.b = b,
.c = c
};
__omp_func1(&md);
a += md.a;
So what are the differences between the no openmp version and the single threaded openmp version?
One difference is that there is extra glue code. The variables that need to be passed to the function created by openmp need to be put together to form one argument. So there is some overhead preparing for the function call (and later retrieving data)
More importantly however, is that now the code is not in one piece any more. Cross-function optimization is not so advanced yet and most optimizations are done within each function. Smaller functions means there is smaller possibility to optimize.
To finish this answer, I'd like to show you exactly how -fopenmp
affects gcc
's options. (Note: I'm on an old computer now, so I have gcc 4.4.3)
Running gcc -Q -v some_file.c
gives this (relevant) output:
GGC heuristics: --param ggc-min-expand=98 --param ggc-min-heapsize=128106
options passed: -v a.c -D_FORTIFY_SOURCE=2 -mtune=generic -march=i486
-fstack-protector
options enabled: -falign-loops -fargument-alias -fauto-inc-dec
-fbranch-count-reg -fcommon -fdwarf2-cfi-asm -fearly-inlining
-feliminate-unused-debug-types -ffunction-cse -fgcse-lm -fident
-finline-functions-called-once -fira-share-save-slots
-fira-share-spill-slots -fivopts -fkeep-static-consts -fleading-underscore
-fmath-errno -fmerge-debug-strings -fmove-loop-invariants
-fpcc-struct-return -fpeephole -fsched-interblock -fsched-spec
-fsched-stalled-insns-dep -fsigned-zeros -fsplit-ivs-in-unroller
-fstack-protector -ftrapping-math -ftree-cselim -ftree-loop-im
-ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops=
-ftree-reassoc -ftree-scev-cprop -ftree-switch-conversion
-ftree-vect-loop-version -funit-at-a-time -fvar-tracking -fvect-cost-model
-fzero-initialized-in-bss -m32 -m80387 -m96bit-long-double
-maccumulate-outgoing-args -malign-stringops -mfancy-math-387
-mfp-ret-in-387 -mfused-madd -mglibc -mieee-fp -mno-red-zone -mno-sse4
-mpush-args -msahf -mtls-direct-seg-refs
and running gcc -Q -v -fopenmp some_file.c
gives this (relevant) output:
GGC heuristics: --param ggc-min-expand=98 --param ggc-min-heapsize=128106
options passed: -v -D_REENTRANT a.c -D_FORTIFY_SOURCE=2 -mtune=generic
-march=i486 -fopenmp -fstack-protector
options enabled: -falign-loops -fargument-alias -fauto-inc-dec
-fbranch-count-reg -fcommon -fdwarf2-cfi-asm -fearly-inlining
-feliminate-unused-debug-types -ffunction-cse -fgcse-lm -fident
-finline-functions-called-once -fira-share-save-slots
-fira-share-spill-slots -fivopts -fkeep-static-consts -fleading-underscore
-fmath-errno -fmerge-debug-strings -fmove-loop-invariants
-fpcc-struct-return -fpeephole -fsched-interblock -fsched-spec
-fsched-stalled-insns-dep -fsigned-zeros -fsplit-ivs-in-unroller
-fstack-protector -ftrapping-math -ftree-cselim -ftree-loop-im
-ftree-loop-ivcanon -ftree-loop-optimize -ftree-parallelize-loops=
-ftree-reassoc -ftree-scev-cprop -ftree-switch-conversion
-ftree-vect-loop-version -funit-at-a-time -fvar-tracking -fvect-cost-model
-fzero-initialized-in-bss -m32 -m80387 -m96bit-long-double
-maccumulate-outgoing-args -malign-stringops -mfancy-math-387
-mfp-ret-in-387 -mfused-madd -mglibc -mieee-fp -mno-red-zone -mno-sse4
-mpush-args -msahf -mtls-direct-seg-refs
Taking a diff, we can see that the only difference is that with -fopenmp
, we have -D_REENTRANT
defined (and of course -fopenmp
enabled). So, rest assured, gcc wouldn't produce worse code. It's just that it needs to add preparation code for when number of threads is greater than 1 and that has some overhead.
Update: I really should have tested this with optimization enabled. Anyway, with gcc 4.7.3, the output of the same commands, added -O3
will give the same difference. So, even with -O3
, there are no optimization's disabled.