Performance issues of multiple independent for loop with openMp

Question

I am planning to use OpenMP threads for an intense computation. However, I couldn't acquire my expected performance in first trial. I thought I have several issues on it, but I have not assured yet. Generally, I am thinking the performance bottleneck is caused from fork and join model. Can you help me in some ways. First, in a route cycle, running on a consumer thread, there is 2 independent for loops and some additional functions. The functions are located at end of the routine cycle and between the for loops, which is already seen below:

void routineFunction(short* xs, float* xf, float* yf, float* h)
{       
    // Casting
    #pragma omp parallel for
    for (int n = 0; n<1024*1024; n++) 
    {
        xf[n] = (float)xs[n];
    }

    memset(yf,0,1024*1024*sizeof( float ));
    // Filtering
    #pragma omp parallel for
    for (int n = 0; n<1024*1024-1024; n++)
    {
        for(int nn = 0; nn<1024; nn++)
        {   
            yf[n]+=xf[n+nn]*h[nn];
        }
    }
    status = DftiComputeBackward(hand, yf, yf); // Compute backward transform
}

Note: This code cannot be compilied, because I did it more readible as clearing details.

OpenMP thread number is set 8 dynamically. I observed the used threads in Windows taskbar. While thread number is increased by significantly, I didn't observe any performance improvement. I have some guesses, but I want to still discuss with you for further implementations.

My questions are these.

Does fork and join model correspond to thread creation and abortion? Is it same cost for the software?
Once routineFunction is called by consumer, Does OpenMP thread fork and join every time?
During the running of rutineFunction, does OpenMP thread fork and join at each for loop? Or, does compiler help the second loop as working with existed threads? In case, the for loops cause fork and join at 2 times, how to align the code again. Is combining the two loops in a single loop sensible for saving performance, or using parallel region (#pragma omp parallel) and #pragma omp for (not #pragma omp parallel for) better choice for sharing works. I care about it forces me static scheduling by using thread id and thread numbers. According the document at page 34, static scheduling can cause load imbalance. Actually, I am familiar static scheduling because of CUDA programming, but I want to still avoid it, if there is any performance issue. I also read an answer in stackoverflow which points smart OpenMP algorithms do not join master thread after a parallel region is completed writed by Alexey Kukanov in last paragraph. How to utilize busy wait and sleep attributes of OpenMP for avoiding joining the master thread after first loop is completed.
Is there another reason for performance issue in the code?

Two questions: How do you measure the time? What percentage of your sequential wall clock time does the parallelized region correspond to? — Gilles, Jan 13 '17 at 14:02
I used clock(), its time resolution is 1 milisecond, however, I tested routineFunction for 100000 times, and get its avarage. — Abdullah, Jan 13 '17 at 14:10
See [this](http://stackoverflow.com/q/10673732/5239503) to understand why using `clock()` is just the worst idea ever when it comes to timing parallel efficiency. — Gilles, Jan 13 '17 at 14:59
By chance, on Window `clock()` does what the OP expects. But yes, use `omp_get_wtime()` instead. — Hristo Iliev, Jan 14 '17 at 23:11

score 1 · Answer 1 · edited Jan 14 '17 at 23:08

Alright, it's been a while since I did OpenMP stuff so hopefully I didn't mess any of this up... but here goes.

Forking and joining is the same thing as creating and destroying threads. How the cost compares to other threads (such as a C++11 thread) will be implementation dependent. I believe in general OpenMP threads might be slightly lighter-weight than C++11 threads, but I'm not 100% sure about that. You'd have to do some testing.
Currently each time routineFunction is called you will fork for the first for loop, join, do a memset, fork for the second loop, join, and then call DftiComputeBackward
You would be better off creating a parallel region as you stated. Not sure why the scheduling is an extra concern. It should be as easy as moving your memset to the top of the function, starting a parallel region using your noted command, and making sure each for loop is marked with #pragma omp for as you mentioned. You may need to put an explicit #pragma omp barrier in between the two for loops to make sure all threads finish the first for loop before starting the second... OpenMP has some implicit barriers but I forgot if #pragma omp for has one or not.
Make sure that the OpenMP compile flag is turned on for your compiler. If it isn't, the pragmas will be ignored, it will compile, and nothing will be different.
Your operations are prime for SIMD acceleration. You might want to see if your compiler supports auto-vectorization and if it is doing it. If not, I'd look into SIMD a bit, perhaps using intrinsics.
How much time does DftiComputeBackwards take relative to this code?

Yes. I have already used SIMD instructions, AVX 256. I gained the speed very linearly, but it is not still enough. Because of requiring faster program, I have to use multi-threading as well. DftiComputeBacwards consume 30% of the total time duration. — Abdullah, Jan 16 '17 at 12:06

Hristo Iliev · Accepted Answer · 2017-01-14T23:37:29.227

This is mostly memory-bound code. Its performance and scalability are limited by the amount of data the memory channel can transfer per unit time. xf and yf take 8 MiB in total, which fits in the L3 cache of most server-grade CPUs but not of most desktop or laptop CPUs. If two or three threads are already able to saturate the memory bandwidth, adding more threads is not going to bring additional performance. Also, casting short to float is a relatively expensive operation - 4 to 5 cycles on modern CPUs.

Does fork and join model correspond to thread creation and abortion? Is it same cost for the software?

Once routineFunction is called by consumer, Does OpenMP thread fork and join every time?

No, basically all OpenMP runtimes, including that of MSVC++, implement parallel regions using thread pools as this is the easiest way to satisfy the requirement of the OpenMP specification that thread-private variables retain their value between the different parallel regions. Only the very first parallel region suffers the full cost of starting new threads. Consequent regions reuse those threads and an additional price is paid only if more threads are needed that in any of the previously executed parallel regions. There is still some overhead, but it is far lower than that of starting new threads each time.

During the running of rutineFunction, does OpenMP thread fork and join at each for loop? Or, does compiler help the second loop as working with existed threads?

Yes, in your case two separate parallel regions are created. You can manually merge them into one:

#pragma omp parallel
{
    #pragma omp for
    for (int n = 0; n<1024*1024; n++) 
    {
        xf[n] = (float)xs[n];
    }

    #pragma omp single
    {
        memset(yf,0,1024*1024*sizeof( float ));
        //
        // Other code that was between the two parallel regions
        //
    }

    // Filtering
    #pragma omp for
    for (int n = 0; n<1024*1024-1024; n++)
    {
        for(int nn = 0; nn<1024; nn++)
        {   
            yf[n]+=xf[n+nn]*h[nn];
        }
    }
}

Is there another reason for performance issue in the code?

It is memory-bound, or at least the two loops shown here are.

I observed routineFunction suffers with 2nd for loop, because it has more computations. I thought the inner loop of the 2nd fits L1 cache. Why it is still memory bound? By the way, I used Xeon E5-4650 based processor, it has 8 cores, 16 threads. Its L1 8x32 KB (8-ways), L3 20 MB (20 ways). — Abdullah, Jan 17 '17 at 13:38

Performance issues of multiple independent for loop with openMp

2 Answers2