NoseKnowsAll has correctly identified your problem.
I would like to explain more about why this problem happened. You could have done this with a square loop like this:
#pragma omp parallel for
for(int i=0; i<n; i++) {
if(i==j) continue;
phys_vector sum = 0;
for(int j=0; j<n; j++) {
//calculate deltaf
sum += deltaf;
}
forces[i] = sum;
}
which uses n*(n-1)
iterations and is easy to parallelize.
But since force(i,j) = -force(j,i)
we can do this in half the iterations, n*(n-1)/2
, using a triangular loop (which is what you have done):
for(int i=0; i<n; i++) {
phys_vector sum = 0;
for(int j=i+1; j<n; j++) {
//calculate deltaf
sum += deltaf;
forces[j] -= deltaf;
}
forces[i] = sum;
}
The problem is when you do this optimization it makes it more difficult to parallelize the outer loop. There are two issues: writing to forces[j]
and the iterations are no longer well distributed i.e. the first thread runs over more iterations than the last thread.
The easy solution is to parellelize the inner loop
#pragma omp parallel
for(int i=0; i<n; i++) {
phys_vector sum = 0;
#pragma omp for
for(int j=i+1; j<n; j++) {
//calculate deltaf
sum += deltaf;
forces[j] -= deltaf;
}
#pragma omp critical
forces[i] += sum;
}
This uses n*nthreads
critical operations out of a total of n*(n-1)/2
iterations. So the cost of the critical operations gets smaller as n gets larger. You could use a private forces
vector for each thread and merge them in a critical section but I don't think this is necessary since the critical operations are on the outer loop and not the inner loop.
Here is a solution which fuses the triangular loop allowing each thread to run over the same number of iterations.
unsigned n = bodies.size();
unsigned r = n*(n-1)/2;
#pragma omp parallel
{
std::vector<phys_vector> forces_local(bodies.size());
#pragma omp for schedule(static)
for(unsigned k=0; k<r; k++) {
unsigned i = (1 + sqrt(1.0+8.0*k))/2;
unsigned j = i - k*(k-1)/2;
//calculate deltaf
forces_local[i] += deltaf;
forces_local[j] -= deltaf;
}
#pragma omp critical
for(unsigned i=0; i<n; i++) forces[i] += forcs_local[i];
}
I was unhappy with my previous method for fusing a triangle (because it needs to use floating point and the sqrt function) so I came up with a much simpler solution based on this answer.
This maps a triangle to a rectangle and visa-versa. First I convert to a rectangle with width n
but with n*(n-1)/2
(same as the triangle). Then I calculate the (row,column) values of the rectangle and then to map to a triangle (which skips the diagonal) I using the following formula
//i is the row, j is the column of the rectangle
if(j<=i) {
i = n - i - 2;
j = n - j - 1;
}
Let's choose an example. Consider the following n=5
triangular loop pairs
(0,1), (0,2), (0,3), (0,4)
(1,2), (1,3), (1,4)
(2,3), (2,4)
(3,4)
mapping this to a rectangle becomes
(3,4), (0,1), (0,2), (0,3), (0,4)
(2,4), (2,3), (1,2), (1,3), (1,4)
Triangle loops with even values work the same way though it might not be as obvious. For example for n = 4
.
(0,1), (0,2), (0,3)
(1,2), (1,3)
(2,3)
this becomes
(2,3), (0,1), (0,2), (0,3)
(1,2), (1,3)
This is not exactly a rectangle but the mapping works the same. I could have instead mapped it as
(0,1), (0,2), (0,3)
(2,3), (1,2), (1,3)
which is a rectangle but then I would need two formulas for odd and even triangle sizes.
Here is the new codes using the rectangle to triangle mapping.
unsigned n = bodies.size();
#pragma omp parallel
{
std::vector<phys_vector> forces_local(bodies.size());
#pragma omp for schedule(static)
for(unsigned k=0; k<n*(n-1)/2; k++) {
unsigned i = k/n;
unsigned j = k%n;
if(j<=i) {
i = n - i - 2;
j = n - j - 1;
}
//calculate deltaf
forces_local[i] += deltaf;
forces_local[j] -= deltaf;
}
#pragma omp critical
for(unsigned i=0; i<n; i++) forces[i] += forcs_local[i];
}