I'm playing around with OpenMP and I've stumbled upon something I don't understand. I'm using the following parallell code (which works correctly). Its execution time almost halves when using twice more threads. However, the execution time using OpenMP with one thread is 35 seconds, while when I comment the pragmas it decreases to 25sec! Is there something I can do to decrease this huge overhead? I'm using gcc 4.8.1 and compiling with "-O2 -Wall -fopenmp".
I read similar topics (OpenMP with 1 thread slower than sequential version, OpenMP overhead) - the opinions differ from no overhead to a lot of overhead. I'm curious is there a better way to use OpenMP in my particular case (for
loop and inside a parallell for
).
for (size_t k = 0 k < maxk; ++k) { // k is ~5000
// init reduction variables
const bool is_time_for_reduction = ;// init from k
double mmin = INFINITY, mmax = -INFINITY;
double sum = 0.0;
#pragma omp parallel shared(m1, m2)
{
// w, h are both between 1000 and 2000
#pragma omp for
for (size_t i = 0; i < h; ++i) { // w,h - consts
for (size_t j = 0; j < w; ++j) {
// computations with matrices m1 and m2, using only m1,m2 and constants w,h
}
}
if (is_time_for_reduction) {
#pragma omp for reduction (max/min/sum: mmax,mmin,sum)
for (size_t i = 0; i < h; ++i) {
for (size_t j = 0; j < w; ++j) {
// reductions
}
}
}
}
if (is_time_for_reduction) {
// use "reduced" variables
}
}