8

I have a few questions regarding #pragma omp for schedule(static) where the chunk size is not specified.

One way to parallelize a loop in OpenMP is to do it manually like this:

#pragma omp parallel 
{
    const int nthreads = omp_get_num_threads();
    const int ithread = omp_get_thread_num();
    const int start = ithread*N/nthreads;
    const int finish = (ithread+1)*N/nthreads;
    for(int i = start; i<finish; i++) {
        //          
    }
}

Is there a good reason not to do manually parallelize a loop like this in OpenMP? If I compare the values with #pragma omp for schedule(static) I see that the chunk sizes for a given thread don't always agree so OpenMP (in GCC) is implementing the chuck sizes different than as defined in start and finish. Why is this?

The start and finish values I defined have several convenient properties.

  1. Each thread gets at most one chunk.
  2. The range of values for iterations increase directly with thread number (i.e. for 100 threads with two threads the first thread will process iterations 1-50 and the second thread 51-100 and not the other way around).
  3. For two for loops over exactly the same range each thread will run over exactly the same iterations.

Edit: Original I said exactly one chunk but after thinking about it it's possible for the size of the chunk to be zero if the number of threads is much larger than N (ithread*N/nthreads = (ithread*1)*N/nthreads). The property I really want is at most one chunk.

Are all these properties guaranteed when using #pragma omp for schedule(static)?

According to the OpenMP specifications:

Programs that depend on which thread executes a particular iteration under any other circumstances are non-conforming.

and

Different loop regions with the same schedule and iteration count, even if they occur in the same parallel region, can distribute ite rations among threads differently. The only exception is for the static schedule

For schedule(static) the specification says:

chunks are assigned to the threads in the team in a round-robin fashion in the order of the thread number.

Additionally the specification says for `schedule(static):

When no chunk_size is specified, the iteration space is divided into chunks that are approximately equal in size, and at most one chunk is distributed to each thread.

Finally, the specification says for schedule(static):

A compliant implementation of the static schedule must ensure that the same assignment of logical iteration numbers to threads will be used in two loop regions if the following conditions are satisfied: 1) both loop regions have the same number of loop iterations, 2) both loop regions have the same value of chunk_size specified, or both loop regions have no chunk_size specified, 3) both loop regions bind to the same parallel region.

So if I read this correctly schedule(static) will have the same convenient properties I listed as start and finish even though my code relies on thread executes a particular iteration. Do I interpret this correctly? This seems to be a special case for schedule(static) when the chunk size is not specified.

It's easier to just define start and finish like I did then try and interrupt the specification for this case.

Z boson
  • 32,619
  • 11
  • 123
  • 226

1 Answers1

10

Is there a good reason not to do manually parallelize a loop like this in OpenMP?

First things that came to my mind:

  1. You added a dependency to the OpenMP library, and thus you either have to duplicate part of your code if you want to maintain a serial compilation or you have to provide stubs for the library function calls.
  2. A worksharing construct requires less code and is more convenient to read than an explicitly parallelized loop. If you need to maintain a large code-base, that would matter.
  3. You miss all the scheduling opportunity apart schedule(static).
  4. Messing with low-level details can easily back-fire.

Are all these properties are guaranteed when using #pragam omp for schedule(static)?

Let's see one by one:

1.) Each thread gets exactly one chunk

When no chunk_size is specified, the iteration space is divided into chunks that are approximately equal in size, and at most one chunk is distributed to each thread. Note that the size of the chunks is unspecified in this case.

At most one chunk is not exactly one chunk. So property one is not fulfilled. Besides this is why the size of the chunk is unspecified.

2.) The range of values for iterations increase directly with thread number (i.e. for 100 iterations with two threads the first thread will process iterations 1-50 and the second thread 51-100 and not the other way around)

A compliant implementation of the static schedule must ensure that the same assignment of logical iteration numbers to threads will be used in two loop regions if the following conditions are satisfied:

  1. both loop regions have the same number of loop iterations
  2. both loop regions have the same value of chunk_size specified, or both loop regions have no chunk_size specified
  3. both loop regions bind to the same parallel region.

A data dependence between the same logical iterations in two such loops is guaranteed to be satisfied allowing safe use of the nowait clause (see Section A.10 on page 182 for examples).

Even though I never saw something different from what you say, I dare say that even property two is not fulfilled, at least not for schedule(static). It seems to me that in an iteration space of a certain cardinality the only guarantee is that the same "logical iteration numbers" will be given to the same thread if condition 1, 2 and 3 are respected.

It is indeed granted if you specify the chunk size:

When schedule(static, chunk_size) is specified, iterations are divided into chunks of size chunk_size, and the chunks are assigned to the threads in the team in a round-robin fashion in the order of the thread number.

3.) For two for loops over exactly the same range each thread will run over exactly the same iterations

This is indeed granted, and is even more general: for two loops with an iteration space of the same cardinality, the same "logical iteration numbers" will be given to each thread. Example A.10.2c of the OpenMP 3.1 standard should clarify this point.

Massimiliano
  • 7,842
  • 2
  • 47
  • 62
  • This is interesting. I'm not sure one or zero chunks is a problem. Two or more chunks per thread would be a problem. I'm filling an array so if the chunk size is zero the value of the array for that thread will be zero which is fine. I'll have to think about that. Maybe I'll update my question. As far as the order I'll wait and see if other answers come in. – Z boson Sep 11 '13 at 20:54
  • good observation on `schedule(static, chunk_size)`. I wonder if the round-robin fashion in order of thread number applies when the chuck size is not specified as well. I bet it does even if the size is not specified. The specification is not clear. – Z boson Sep 11 '13 at 20:58
  • I just realized that for small N and a large number of threads that `ithread*N/nthreads=(ithread+1)*N/nthreads`. In other words the chunk size would be zero (effectively no chunk). I changed my question to reflect that. – Z boson Sep 11 '13 at 21:03
  • you gave the best answer based on the documentation. I still think that the chunks are chosen in order of thread number even when the chunk size is not specified for static but the documentation is not clear. – Z boson Sep 12 '13 at 17:08