How to optimally parallelize nested loops?

Question

I'm writing a program that should run both in serial and parallel versions. Once I get it to actually do what it is supposed to do I started trying to parallelize it with OpenMP (compulsory).

The thing is I can't find documentation or references on when to use what #pragma. So I am trying my best at guessing and testing. But testing is not going fine with nested loops.

How would you parallelize a series of nested loops like these:

for(int i = 0; i < 3; ++i){
    for(int j = 0; j < HEIGHT; ++j){
        for(int k = 0; k < WIDTH; ++k){
            switch(i){
                case 0:
                        matrix[j][k].a = matrix[j][k] * someValue1;
                        break;
                case 1:
                        matrix[j][k].b = matrix[j][k] * someValue2;
                        break;   
                case 2:
                        matrix[j][k].c = matrix[j][k] * someValue3;                
                        break;
            }
        }
    }
}

HEIGHT and WIDTH are usually the same size in the tests I have to run. Some test examples are 32x32 and 4096x4096.
matrix is an array of custom structs with attributes a, b and c
someValue is a double

I know that OpenMP is not always good for nested loops but any help is welcome.

[UPDATE]:

So far I've tried unrolling the loops. It boosts performance but am I adding unnecesary overhead here? Am I reusing threads? I tried getting the id of the threads used in each for but didn't get it right.

#pragma omp parallel
        {
#pragma omp for collapse(2)
            for (int j = 0; j < HEIGHT; ++j) {
                for (int k = 0; k < WIDTH; ++k) {
                    //my previous code here
                }
            }
#pragma omp for collapse(2)
            for (int j = 0; j < HEIGHT; ++j) {
                for (int k = 0; k < WIDTH; ++k) {
                    //my previous code here
                }
            }
#pragma omp for collapse(2)
            for (int j = 0; j < HEIGHT; ++j) {
                for (int k = 0; k < WIDTH; ++k) {
                    //my previous code here
                }
            }
        }

[UPDATE 2]

Apart from unrolling the loop I have tried parallelizing the outer loop (worst performance boost than unrolling) and collapsing the two inner loops (more or less same performance boost as unrolling). This are the times I am getting.

Serial: ~130 milliseconds
Loop unrolling: ~49 ms
Collapsing two innermost loops: ~55 ms
Parallel outermost loop: ~83 ms

What do you think is the safest option? I mean, which should be generally the best for most systems, not only my computer?

Sorry that was a typo. Corrected now @HighPerformanceMark – danielsto Nov 19 '16 at 10:58 — danielsto, Nov 19 '16 at 10:58
I assume `i` in the innermost loop is a typo for `k`? – Davislor Nov 19 '16 at 11:01 — Davislor, Nov 19 '16 at 11:01
Yes, @Davislor. It is corrected now. – danielsto Nov 19 '16 at 11:02 — danielsto, Nov 19 '16 at 11:02
I have updated the code and a try at unrolling – danielsto Nov 19 '16 at 12:51 — danielsto, Nov 19 '16 at 12:51

Davislor · Answer 1 · 2016-11-20T05:56:14.400

1

You probably want to parallelize this example for simd so the compiler can vectorize, collapse the loops because you use j and k only in the expression matrix[j][k], and because there are no dependencies on any other element of the matrix. If nothing modifies somevalue1, etc., they should be uniform. Time your loop to be sure those really do improve your speed.

edited Nov 20 '16 at 05:56

answered Nov 19 '16 at 11:06

Davislor

14,674
2
34
49

score 1 · Answer 2 · edited May 23 '17 at 10:33

1

The problem with OpenMP is that it's very high-level, meaning that you can't access low-level functionality, such as spawning the thread, and then reusing it. So let me make it clear what you can and what you can't do:

Assuming you don't need any mutex to protect against race conditions, here are your options:

You parallelize your outer-most loop, and that will use 3 threads, and that's the most peaceful solution you're gonna have
You parallelize the first inner loop with, and then you'll have a performance boost only if the overhead of spawning a new thread for every WIDTH element is much smaller the efforts required to perform the most inner loop.
Parallelizing the most inner loop, but this is the worst solution in the world, because you'll respawn the threads 3*HEIGHT times. Never do that!
Not use OpenMP, and use something low-level, such as std::thread, where you can create your own Thread Pool, and push all the operations you want to do in a queue.

Hope this helps to put things in perspective.

edited May 23 '17 at 10:33

Community

1
1

answered Nov 19 '16 at 11:06

The Quantum Physicist

24,987
19
103
189

Would it be better if I post some example `HEIGHT` and `WIDTH`? When you say parallelize some loop you mean using only `#pragma omp parallel for` without any `collapse(n)` or another clause, right? Would you ever considering collapsing any of these loops? – danielsto Nov 19 '16 at 11:21
Good OpenMP libraries do use a thread pool and do re-use them. They don't start a new thread every time. There is still a lot of overhead with synchronizations of course. Collapse will be a good thing here. – Vladimir F Героям слава Nov 19 '16 at 11:34
@danielsto Using collapse is a good idea, and if Vladimir is right, then you'd be lucky if a thread pool is automatically used in OpenMP, but that's not the experience I had with it. An example will unfortunately not help because this is very dependent on your system. All you can do is plan, try and study case-by-case. – The Quantum Physicist Nov 19 '16 at 11:37
@VladimirF Actually thinking about it, if a library does use a thread pool, then collapse is useless. It won't improve performance at all. Right? – The Quantum Physicist Nov 19 '16 at 11:38
I get that this is very dependent on the system (and unforntunately it's going to be tested for grading on another system different that I don't have access to). Is there any way of getting a general performance boost? I mean, what is the most reasonable approach given the data that I provide? Which loops would you collapse? – danielsto Nov 19 '16 at 11:40
From what I have tested so far (with the 4096x4096 size) I get pretty much the same performance boost parallelizing the outermost loop than collapsing the two innermost loops. – danielsto Nov 19 '16 at 11:43
1

@danielsto A rule of thumb in computing is: The more general your code is, the less performance you'll get. For example, BLAS is a famous API for matrix multiplications. If you create a single implementation to run it everywhere, then it's not the best. OpenBLAS creates an implementation depending on your processor, and that's the best you can get. You see the point? So you have to go more general, and sacrifice performance, but follow the general guide-lines that are not system dependent. Try to minimize the number of thread-spawns and you'll be fine. That's the best you can do, I guess. – The Quantum Physicist Nov 19 '16 at 11:44
Collapse is not useless, scheduling the work in a thread pool costs a lot of overhead. Static scheduling in collapsed loops is simple and effective. – Vladimir F Героям слава Nov 19 '16 at 11:46
@VladimirF I see what you mean, so it's just the difference between preparing everything beforehand. But this assumes that all the operations will take about the same time, which is not always the case, but may be here the case. – The Quantum Physicist Nov 19 '16 at 11:47
I understand. That's why I find a bit weird that I will be graded (partially) on the amount of performance I can get for a system that I don't even know any of its details. Thank you anyway, @TheQuantumPhysicist! – danielsto Nov 19 '16 at 11:47
@TheQuantumPhysicist I think they will take the same amount of time, since each iteration does actually the same: get three doubles from a matrix of custom structs, multiply them by another one (constant) and store them back again. – danielsto Nov 19 '16 at 11:50
@danielsto Then go ahead and experiment with collapse, and choose the best. Actually, just FYI, some libraries tweak these parameters with tests before getting you to use their library, such as ATLAS. It runs tests on your system and measures what best solutions are available for parallelization, and then decides how to make the BLAS calls do best performance on your particular system. – The Quantum Physicist Nov 19 '16 at 11:52

score 1 · Answer 3 · answered Nov 19 '16 at 11:52

Here's another option, one which recognises that distributing the iterations of the outermost loop when there are only 3 of them might lead to very poor load balancing,

i=0
#pragma omp parallel for
for(int j = 0; j < HEIGHT; ++j){
    for(int k = 0; k < WIDTH; ++k){
    ...
}

i=1
#pragma omp parallel for
for(int j = 0; j < HEIGHT; ++j){
    for(int k = 0; k < WIDTH; ++k){
    ...
}

i=2
#pragma omp parallel for
for(int j = 0; j < HEIGHT; ++j){
    for(int k = 0; k < WIDTH; ++k){
    ...
}

Warning -- check the syntax yourself, this is no more than a sketch of manual loop unrolling.

Try combining this and collapsing the j and k loops.

Oh, and don't complain about code duplication, you've told us you're being scored partly on performance improvements.

What's the difference between this and putting this in a loop that loops over `i`? I don't get it. — The Quantum Physicist, Nov 19 '16 at 11:57
Not sure if I understood. Do you mean that leaving the outermost loop like it was might lead to poor load balancing? So unrolling the loop would lead to better load balancing, right? — danielsto, Nov 19 '16 at 18:09

How to optimally parallelize nested loops?

3 Answers3