I've spent time going over other posts but I still can't get this simple program to go.
#include<iostream>
#include<cmath>
#include<omp.h>
using namespace std;
int main()
{
int threadnum =4;//want manual control
int steps=100000,cumulative=0, counter;
int a,b,c;
float dum1, dum2, dum3;
float pos[10000][3] = {0};
float non=0;
//RNG declared
#pragma omp parallel private(dum1,dum2,dum3,counter,a,b,c) reduction (+: non, cumulative) num_threads(threadnum)
{
for(int dummy=0;dummy<(10000/threadnum);dummy++)
{
dum1=0,dum2=0,dum3=0;
a=0,b=0,c=0;
for (counter=0;counter<steps;counter++)
{
dum1 = somefunct1()+rand();
dum2=somefunct2()+rand();
dum3 = somefunct3(dum1, dum2, ...);
a += somefunct4(dum1,dum2,dum3, ...);
b += somefunct5(dum1,dum2,dum3, ...);
c += somefunct6(dum1,dum2,dum3, ...);
cumulative++; //count number of loops executed
}
pos[dummy][0] = a;//saves results of second loop to array
pos[dummy][1] = b;
pos[dummy][2] = c;
non+= pos[dummy][0];//holds the summed a values
}
}
}
I've cut down the program to get it to fit here. A lot of times if I make changes, and I've tried a lot, a lot of time the inner loop simply does not execute the correct number of times and I get cumulative equal to something like 32,532,849 instead of 1 billion. Scaling is about 2x for the code above but should be much higher.
I want the code to simply break the first 10000 iteration for loop so that each thread runs a certain number of iterations in parallel (if this could be dynamic that would be nice) and saves the results of each iteration of the second for loop to the results array. The second for loop is composed of dependents and cannot be broken. Currently the order of the 'dummy' iterations do not matter (can switch pos[345] with pos[3456] as long as all three indices are switches) but I will have to modify it later so it does matter.
The numerous variables and initializations in the inner loop are confusing me terribly. There are a lot of random calls and functions/math functions in the inner loop - is there overhead here that is causing a problem? I'm using GNU 4.9.2 on windows.
Any help would be greatly appreciated.
Edit: finally fixed. Moved the RNG declaration inside the first for loop. Now I get 3.75x scaling going to 4 threads and 5.72x scaling on 8 threads (hyperthreads). Not perfect but I will take it. I still think there is an issue with thread locking and syncing.
......
float non=0;
#pragma omp parallel private(dum1,dum2,dum3,counter,a,b,c) reduction (+: non, cumulative) num_threads(threadnum)
{
//RNG declared
#pragma omp for
for(int dummy=0;dummy<(10000/threadnum);dummy++)
{
....