OpenMP double for loop array with stored results

Question

I've spent time going over other posts but I still can't get this simple program to go.

#include<iostream>
#include<cmath>
#include<omp.h>
using namespace std;

int main()
{
int threadnum =4;//want manual control
int steps=100000,cumulative=0, counter;
int a,b,c;
float dum1, dum2, dum3;
float pos[10000][3] = {0};
float non=0;
//RNG declared

#pragma omp parallel private(dum1,dum2,dum3,counter,a,b,c) reduction (+: non, cumulative) num_threads(threadnum)
{
    for(int dummy=0;dummy<(10000/threadnum);dummy++)
    {
            dum1=0,dum2=0,dum3=0;
            a=0,b=0,c=0;
            for (counter=0;counter<steps;counter++)
            {
                dum1 = somefunct1()+rand();
                dum2=somefunct2()+rand();
                dum3 = somefunct3(dum1, dum2, ...);

                a += somefunct4(dum1,dum2,dum3, ...);
                b += somefunct5(dum1,dum2,dum3, ...);
                c += somefunct6(dum1,dum2,dum3, ...);

                cumulative++; //count number of loops executed
            }
            pos[dummy][0] = a;//saves results of second loop to array
            pos[dummy][1] = b;
            pos[dummy][2] = c;
            non+= pos[dummy][0];//holds the summed a values
        }
}
}

I've cut down the program to get it to fit here. A lot of times if I make changes, and I've tried a lot, a lot of time the inner loop simply does not execute the correct number of times and I get cumulative equal to something like 32,532,849 instead of 1 billion. Scaling is about 2x for the code above but should be much higher.

I want the code to simply break the first 10000 iteration for loop so that each thread runs a certain number of iterations in parallel (if this could be dynamic that would be nice) and saves the results of each iteration of the second for loop to the results array. The second for loop is composed of dependents and cannot be broken. Currently the order of the 'dummy' iterations do not matter (can switch pos[345] with pos[3456] as long as all three indices are switches) but I will have to modify it later so it does matter.

The numerous variables and initializations in the inner loop are confusing me terribly. There are a lot of random calls and functions/math functions in the inner loop - is there overhead here that is causing a problem? I'm using GNU 4.9.2 on windows.

Any help would be greatly appreciated.

Edit: finally fixed. Moved the RNG declaration inside the first for loop. Now I get 3.75x scaling going to 4 threads and 5.72x scaling on 8 threads (hyperthreads). Not perfect but I will take it. I still think there is an issue with thread locking and syncing.

......
float non=0;    
#pragma omp parallel private(dum1,dum2,dum3,counter,a,b,c) reduction (+: non, cumulative) num_threads(threadnum)
{
    //RNG declared
    #pragma omp for
    for(int dummy=0;dummy<(10000/threadnum);dummy++)
    {
....

It looks like you're missing a `for` in your pragma directive. It should look like one of [these examples](http://stackoverflow.com/q/1448318/1560509) — memo1288, Jun 19 '15 at 18:21
I tried that. inserting the for right after parallel creates an error. Inserting #pragma omp for just before the second loop only runs 10000/threadnum loops and there is still no change in speedup. — I am tired, Jun 19 '15 at 19:03
change `10000/threadnum` for `10000`, because openMP takes care of dividing the loop between the threads. Also, in the first option, you would remove the { between the pragma and the for — memo1288, Jun 19 '15 at 19:08
Done that and it works. However the speedup is the exact same as the original code, 2x going from 1->4 threads. Something is causing a lot of overhead and I have no clue what that is. For performance purposes 1 thread gives 8N calculations, 2 threads gives 10N, 4 threads gives 15N and 8 threads (hyperthreading) gives 20N. The code however is completely independent and I should get a much better speedup. — I am tired, Jun 19 '15 at 19:47
If you go for the combined `#pragma omp parallel for ...`, you should remove the extra set of curly braces that surround the loop. The OpenMP `for` construct expects a loop to immediately follow it, not a block (i.e. `{ ... }`). — Hristo Iliev, Jun 20 '15 at 22:09

OpenMP double for loop array with stored results

0 Answers0