I've run across something odd. I am testing an MPI + OMP parallel code on a small local machine with only a single, humble 4 core I3. One of my loops, it turns out, is very slow with more than 1 OMP thread per process in this environment (more threads than cores).
#pragma omp parallel for
for ( int i = 0; i < HEIGHT; ++i )
{
for ( int j = 0; j < WIDTH; ++j )
{
double a =
( data[ sIdx * S_SZ + j + i * WIDTH ] - dMin ) / ( dMax - dMin );
buff[ i ][ j ] = ( unsigned char ) ( 255.0 * a );
}
}
If I run this code with the defaults (without setting OMP_NUM_THREADS
, or using omp_set_num_threads
), then it takes about 1 s. However, if I explicitly set the number of threads with either method (export OMP_NUM_THREADS=1
or omp_set_num_threads(1))
then it takes about 0.005 s (200X faster).
But it seems that omp_get_num_threads()
returns 1 regardless. And in fact, if I just do this omp_set_num_threads( omp_get_num_threads() );
then it takes about 0.005 s, whereas commenting that line out it takes 1 s.
Any idea what is going on here? Why should calling omp_set_num_threads( omp_get_num_threads() )
once at the beginning of a program ever result in a 200X difference in performance?
Some context,
cpu: Intel(R) Core(TM) i3-9100F CPU @ 3.60GHz
g++ --version: g++ (GCC) 10.2.0
compiler flags: mpic++ -std=c++11 -O3 -fpic -fopenmp ...
running program: mpirun -np 4 ./a.out