I have a few questions regarding #pragma omp for schedule(static)
where the chunk size is not specified.
One way to parallelize a loop in OpenMP is to do it manually like this:
#pragma omp parallel
{
const int nthreads = omp_get_num_threads();
const int ithread = omp_get_thread_num();
const int start = ithread*N/nthreads;
const int finish = (ithread+1)*N/nthreads;
for(int i = start; i<finish; i++) {
//
}
}
Is there a good reason not to do manually parallelize a loop like this in OpenMP? If I compare the values with #pragma omp for schedule(static)
I see that the chunk sizes for a given thread don't always agree so OpenMP (in GCC) is implementing the chuck sizes different than as defined in start
and finish
. Why is this?
The start
and finish
values I defined have several convenient properties.
- Each thread gets at most one chunk.
- The range of values for iterations increase directly with thread number (i.e. for 100 threads with two threads the first thread will process iterations 1-50 and the second thread 51-100 and not the other way around).
- For two for loops over exactly the same range each thread will run over exactly the same iterations.
Edit: Original I said exactly one chunk but after thinking about it it's possible for the size of the chunk to be zero if the number of threads is much larger than N
(ithread*N/nthreads = (ithread*1)*N/nthreads
). The property I really want is at most one chunk.
Are all these properties guaranteed when using #pragma omp for schedule(static)
?
According to the OpenMP specifications:
Programs that depend on which thread executes a particular iteration under any other circumstances are non-conforming.
and
Different loop regions with the same schedule and iteration count, even if they occur in the same parallel region, can distribute ite rations among threads differently. The only exception is for the static schedule
For schedule(static)
the specification says:
chunks are assigned to the threads in the team in a round-robin fashion in the order of the thread number.
Additionally the specification says for `schedule(static):
When no chunk_size is specified, the iteration space is divided into chunks that are approximately equal in size, and at most one chunk is distributed to each thread.
Finally, the specification says for schedule(static)
:
A compliant implementation of the static schedule must ensure that the same assignment of logical iteration numbers to threads will be used in two loop regions if the following conditions are satisfied: 1) both loop regions have the same number of loop iterations, 2) both loop regions have the same value of chunk_size specified, or both loop regions have no chunk_size specified, 3) both loop regions bind to the same parallel region.
So if I read this correctly schedule(static)
will have the same convenient properties I listed as start
and finish
even though my code relies on thread executes a particular iteration. Do I interpret this correctly? This seems to be a special case for schedule(static)
when the chunk size is not specified.
It's easier to just define start
and finish
like I did then try and interrupt the specification for this case.