Influence on the static scheduling overhead in OpenMP

Question

I thought about which factors would influence the static scheduling overhead in OpenMP. In my opinion it is influenced by:

CPU performance
specific implementation of the OpenMP run-time library
the number of threads

But am I missing further factors? Maybe the size of the tasks, ...?

And furthermore: Is the overhead linearly dependent on the number of iterations? In this case I would expect that having static scheduling and 4 cores, the overhead increases linearly with 4*i iterations. Correct so far?

EDIT: I am only interested in the static (!) scheduling overhead itself. I am not talking about thread start-up overhead and time spent in synchronisation and critical section overhead.

Z boson · Answer 1 · 2016-01-07T19:00:44.273

5

You need to separate the overhead for OpenMP to create a team/pool of threads and the overhead for each thread to operate separate sets of iterators in a for loop.

Static scheduling is easy to implement by hand (which is sometimes very useful). Let's consider what I consider the two most important static scheduling schedule(static) and schedule(static,1) then we can compare this to schedule(dynamic,chunk).

#pragma omp parallel for schedule(static)
for(int i=0; i<N; i++) foo(i);

is equivalent to (but not necessarily equal to)

#pragma omp parallel
{
    int start = omp_get_thread_num()*N/omp_get_num_threads();
    int finish = (omp_get_thread_num()+1)*N/omp_get_num_threads();
    for(int i=start; i<finish; i++) foo(i);
}

and

#pragma omp parallel for schedule(static,1)
for(int i=0; i<N; i++) foo(i);

is equivalent to

#pragma omp parallel 
{
    int ithread = omp_get_thread_num();
    int nthreads = omp_get_num_threads();
    for(int i=ithread; i<N; i+=nthreads) foo(i);
}

From this you can see that it's quite trivial to implement static scheduling and so the overhead is negligible.

On the other hand if you want to implement schedule(dynamic) (which is the same as schedule(dynamic,1)) by hand it's more complicated:

int cnt = 0;
#pragma omp parallel
for(int i=0;;) {
    #pragma omp atomic capture
    i = cnt++;
    if(i>=N) break;
    foo(i);                                    
}

This requires OpenMP >=3.1. If you wanted to do this with OpenMP 2.0 (for MSVC) you would need to use critical like this

int cnt = 0;
#pragma omp parallel
for(int i=0;;) {
    #pragma omp critical   
    i = cnt++;
    if(i>=N) break;
    foo(i);
}

Here is an equivalent to schedule(dynamic,chunk) (I have not optimized this using atomic accesss):

int cnt = 0;
int chunk = 5;
#pragma omp parallel
{
    int start, finish;
    do {
        #pragma omp critical
        {
            start = cnt;
            finish = cnt+chunk < N ? cnt+chunk : N;
            cnt += chunk;
        }
        for(int i=start; i<finish; i++) foo(i);
    } while(finish<N);
}

Clearly using atomic accesses is going to cause more overhead. This also shows why using larger chunks for schedule(dynamic,chunk) can reduce the overhead.

edited Jan 07 '16 at 19:00

answered Jun 02 '15 at 08:59

Z boson

32,619
11
123
226

thanks for this detailed answer. Your explanations are, however, already known to me. I'am interested only in the scheduling overhead itself, that means no thread start-up overhead and no synchronisation (e.g. barrier) and critical section overhead. In my understanding the scheduling overhead is the time spent for assigning a task to a thread (maybe including the time spent for deciding which task to assign to which thread. So I am interested, which factors influence this SCHEDULING overhead. Maybe I did not formulate my question clear enough – knacker123 Jun 02 '15 at 11:00
@knacker123, I described the SCHEDULING overhead. I did not describe the start-up overhead. Did you read my answer? For static scheduling the overhead is the cost to calculate start and finish which is trivial. It has constant time `(O(1))` and does not depend on the number of threads or the number of iterations. However, with dynamic scheduling the overhead does depend on the number of iterations. Isn't that clear in my answer? – Z boson Jun 02 '15 at 11:37
Your dynamic scheduling example is perhaps a little misleading, since you could have updated the shared data using atomics. – jch Jun 02 '15 at 17:36
@jch, thanks for your comment. I'll see if I can fix it and update my answer when I have time. – Z boson Jun 03 '15 at 08:32
@jch, I'm not sure how to do this with `atomic`. How would you do it? – Z boson Jun 10 '15 at 08:58
`#pragma omp atomic capture seq_cst` `i = cnt++;`. – jch Jun 10 '15 at 12:15
@jch, I think that requires OpenMP>=3.1 Actually, maybe even 4.0. This is the problem. I try to do everything using OpenMP2.0. I'll update may answer when I test this. – Z boson Jun 10 '15 at 12:33
@jch, is `seq_cst` nessary (that's from OpenMP 4.0). What does this do? What about just `#pragma omp atomic capture` `i = cnt++;`? – Z boson Jun 10 '15 at 12:39
@jch, I updated may answer, thank you for your comment. – Z boson Jun 10 '15 at 13:00
@Zboson you're right, it's not needed, I've edited your answer. Perhaps you could merge the atomic version into the body of the answer rather than putting it at the end? – jch Jun 10 '15 at 19:23
@jch, `atmoic capture` requires OpenMP 3.1. Visual Studio only supports OpenMP 2.0 so I think the solution using critical is interesting as well. I updated my answer putting the atomic first. Thanks again for your comments. – Z boson Jun 11 '15 at 07:12
@Zboson how can we do this in GSS (Guided-Self-Scheduling) without using openmp GSS keyword? – Tara Nov 28 '20 at 15:44

Influence on the static scheduling overhead in OpenMP

1 Answers1

Linked