How can I improve the perfomance of my OpenMP code?

Question

I am currently trying to improve parallel performance on my Code and I am still new to OpenMP. I have to iterate over a large container, in each iteration reading from multiple entries and writing a result to a single entry. Below is a very minmal Code example of what I am trying to do.

data is a pointer to an array, where a lot of datapoints are stored. Before the parallel region I create an Array newData, so can use data as read-only and newData as write-only, afterwards I throw the old data away and use newDatafor further calculations. To my understanding data and newDataare shared between threads and everything declared inside the parallel region is private. Can reading from databy multiple threads cause performance issues?

I am using #critical for assigning a new value to an element of newData to avoid race conditions. Is this necessary, since I access every element of newDataonly once and never by multiple threads?

Also I am not sure about scheduling. Do I have to specify if I want a static or dynamic schedule? Can I use nowait since all threads are idependent of each other?

array *newData = new array;

omp_set_num_threads (threads);

#pragma omp parallel
{
    #pragma omp for
    for (int i = 0;  i < range; i++)
    {
        double middle = (*data)[i];
        double previous = (*data)[i-1];
        double next = (*data)[i+1];

        double new_value = (previous + middle + next) / 3.0;
        #pragma omp critical(assignment)
        (*newData)[i] = new_value;
    }
}

delete data;
data = newData;

I am aware that in the first and last iteration previous and next can not be read from data, in the real code this is taken care of but for this minimal example you get the idea of reading multiple times from data.

as a first step, always identify the bottleneck, which is... (left as an exercise) — Karoly Horvath, Sep 28 '16 at 15:10
Are you sure that this code is parallelizable? There are dependencies that could not be satisfied if executing the code in parallel. — Harald, Sep 28 '16 at 15:21
Since I only use datapoints from a previous timestep (`data`) to calculate datapoint in a new timestep `newData` I don't sees any dependencies that break parallelization per se. — phirus, Sep 28 '16 at 15:32
At first glance, the code looks fine. Just use a `schedule(static)` and get rid of the `critical`. And be sure to compile with maximum optimization enabled. — Gilles, Sep 28 '16 at 16:09
So the `critical` is not necessary? I was not sure about this. Thanks — phirus, Sep 28 '16 at 16:31
Please note that you can easily fill [an entire lecture](http://www3.cs.fau.de/Lehre/ArchSup/WS2011/slides.pdf) with optimizing this seemingly simple code. In practice most of the time you should just cleanly express the algorithm to the compiler and be done. If you optimize further, the most important thing is to measure! — Zulan, Sep 29 '16 at 10:47

score 5 · Answer 1 · answered Sep 29 '16 at 09:26

First of all, get rid of all unnecessary dependencies. #pragma omp critical(assignment) is not necessary because each index of (*newData) is only written to once per loop, so there's no race condition.

Your code could now look like this:

#pragma omp parallel for
for (int i = 0; i < range; i++)
   (*newData)[i] = ((*data)[i-1] + (*data)[i] + (*data)[i+1]) / 3.0;

Now we're looking for bottlenecks. The list of potential candidates I came up with is this:

Slow division
Cache thrashing
ILP (Instruction level parallelism)
Memory bandwith limitations
Hidden dependencies

So let's analyze them further.

Slow division: It takes some CPUs forever to calculate double/double. To know how long and what througput your CPU has, you have to look at its specs. Maybe replacing /3.0 with *0.3333.. might help, but maybe your compiler does this already. Using extended instruction sets (like SSE/AVX) you might shedule several divisions/multiplications at once.

Cache thrashing: Because your CPU has to load/store one cache line at a time there could be conflicts. Imagine if thread 1 tries to write to (*newdata)[1] and thread 2 to (*newdata)[2] and they are on the same cache line. Now one of them has to wait for the other. You could resolve this with #pragma omp parallel for schedule(static, 64).

ILP: CPUs can schedule multiple operations into a pipeline if the operations are independent. For this to happen you have to unroll your loop. This could look like this:

assert(range % 4 == 0);
#pragma omp parallel for
for (int i = 0; i < range/4; i++) {
   (*newData)[i*4+0] = ((*data)[i*4-1] + (*data)[i*4+0] + (*data)[i*4+1]) / 3.0;
   (*newData)[i*4+1] = ((*data)[i*4+0] + (*data)[i*4+1] + (*data)[i*4+2]) / 3.0;
   (*newData)[i*4+2] = ((*data)[i*4+1] + (*data)[i*4+2] + (*data)[i*4+3]) / 3.0;
   (*newData)[i*4+3] = ((*data)[i*4+2] + (*data)[i*4+3] + (*data)[i*4+4]) / 3.0;
}

Memory bandwith limitations: For your very simple loop think about this. How much memory do you have to load and how long will your CPU be busy processing it. You're loading about 1 cache line and computing some dereferences, some pointer addition, two additions and one division. Which limit you hit depends on your CPU specs. Now consider cache locality. Can you modify your code to make better use of the cache? If one thread gets i=3 in one loop-iteration, and i=7 in the next, you have to reload 3 (*data)'s. But if you would go from i=3 to i=4, you might not have to load anything, because (*data)[i+1] was in the cacheline previously loaded. You save some RAM bandwith. To make use of this, unroll the loop. Also using float instead of double increases this chance.

Hidden dependencies: Now this part I personally find very tricky. Sometimes your compiler isn't shure it can reuse some data, because it doesn't know it hasn't changed. Using const helps the compiler. But sometimes you need a restrict to give the compiler the right hint. But I don't understand this well enough to explain it.

So here is what I would try:

const double ONETHIRD = 1.0 / 3.0;
assert(range % 4 == 0);
#pragma omp parallel for schedule(static, 1024)
for (int i = 0; i < range/4; i++) {
   (*newData)[i*4+0] = ((*data)[i*4-1] + (*data)[i*4+0] + (*data)[i*4+1]) * ONETHIRD;
   (*newData)[i*4+1] = ((*data)[i*4+0] + (*data)[i*4+1] + (*data)[i*4+2]) * ONETHIRD;
   (*newData)[i*4+2] = ((*data)[i*4+1] + (*data)[i*4+2] + (*data)[i*4+3]) * ONETHIRD;
   (*newData)[i*4+3] = ((*data)[i*4+2] + (*data)[i*4+3] + (*data)[i*4+4]) * ONETHIRD;
}

And then benchmark. Benchmark some more, and benchmark some more. Only benchmarks will show you which tricks help.

PS: One more thing to consider. If you see your program hitting the memory bandwith hard. You could consider changing the algorithm. Maybe fuse two steps into one. Like going from b[i] := (a[i-1] + a[i] + a[i+1]) / 3.0 to d[i] := (n[i-1] + n[i] + n[i+1]) / 3.0 = (a[i-2] + 2.0 * a[i-1] + 3.0 * a[i] + 2.0 * a[i+1] + a[i+1]) / 3.0. I think the reason for this you will find out yourself.

Have fun optimizing ;-)

score 2 · Accepted Answer · edited May 23 '17 at 11:48

Reading an array by multiple threads usually does no harm.
You only need a critical section if multiple threads work on the exact same piece of data, here each thread accesses a different part of the array so you dont need it. Critical sections are very bad for performance so only use them if absolutely necessary. Often they can be replaced by atomic actions: openMP, atomic vs critical? Like a critical section, they dont make sense if each thread accesses different data.
For the scheduler its best to test them each and measure the performance as predictions about performance are often wrong. Also try different chunk sizes.
Some other things that might help:
- Measuring performance is often interferred by other tasks on your pc so take multiple measurements and take their minimum (except if the input is different each time, then take the average and do more measurements).
- Do you really need double precision? Floats are a lot faster.
edit: nowait is for multiple independent for loops: https://msdn.microsoft.com/en-us/library/ek5st0e3.aspx

score 0 · Answer 3 · answered Sep 28 '16 at 18:39

I assume you are trying to do some kind of convolution or median blur with 1D array. The short answer is: stick to default schedule strategy, and get rid of critical at all.

As I can tell, you are a quit newbie to parallelism, it's a little bit confusion to deal with OpenMP directives, like nowait/private/reduction/critical/atomic/single, etc. I think what you need is a well written textbook to clarify various concept. If you had a sound knowledge, a hour of learning OpenMP could be enough to deal with most daily programming.

How can I improve the perfomance of my OpenMP code?

3 Answers3

Linked