The simple idea of using a recursive algorithm for such a workload seems already very strange to me. And then, to parallelise it using OpenMP tasks seems even stranger... Why not tackling the problem with a more conventional approach?
So I decided to give it a go with a few methods that came to my mind. But to make the exercise sensible, it was important that some actual work was done for computing the "symmetric calculation", otherwise, just iterating on a all elements without accounting for the symmetrical aspect would certainly be the best option.
So I wrote a a force()
function computing something loosely related to gravitational interactions between two bodies, based on their coordinates.
Then I tested 4 different methods to iterate on the particles:
A naïve triangular approach such as the one you proposed. Due to it's intrinsic load-unbalanced aspect, this one is parallelised with a schedule(auto)
clause to permit the run-time library to take the decision it thinks best for performance.
A "clever" traversal of the triangular domain, consisting in cutting it in half in the j
direction to allow for using 2 regular loops. Basically, it corresponds to to something like this:
/|
/ | __ __
/ | => | // |
/___| |//____|
A straightforward rectangular approach, just ignoring the symmetry. NB, this one, like your recursive approach, guarantees non-concurrent accesses to the force array.
A linearised method consisting in pre-computing the order of i
and j
indexes for accessing the triangular domain, and to iterate over the vector containing these indexes.
Since the vector where the forces are accumulated with a force[i] += fij; force[j] -= fij;
approach would generate race conditions for the updates in the non-parallelised index (j
for example in the method #1), I've created a local pre-thread force array, which is initialised to 0 upon entry to the parallel region. The computations are then done pre-thread on this "private" array, and the individual contributions are accumulated on the global force array with a critical
construct upon exit of the parallel region. This is a typical reduction pattern for arrays in OpenMP.
Here is the full code for this:
#include <iostream>
#include <vector>
#include <cstdlib>
#include <cmath>
#include <omp.h>
std::vector< std::vector<double> > l_f;
std::vector<double> x, y, f;
std::vector<int> vi, vj;
int numth, tid;
#pragma omp threadprivate( tid )
double force( int i, int j ) {
double dx = x[i] - x[j];
double dy = y[i] - y[j];
double dist2 = dx*dx + dy*dy;
return dist2 * std::sqrt( dist2 );
}
void loop1( int n ) {
#pragma omp parallel
{
for ( int i = 0; i < n; i++ ) {
l_f[tid][i] = 0;
}
#pragma omp for schedule(auto) nowait
for ( int i = 0; i < n-1; i++ ) {
for ( int j = i+1; j < n; j++ ) {
double fij = force( i, j );
l_f[tid][i] += fij;
l_f[tid][j] -= fij;
}
}
#pragma omp critical
for ( int i = 0; i < n; i++ ) {
f[i] += l_f[tid][i];
}
}
}
void loop2( int n ) {
int m = n/2-1+n%2;
#pragma omp parallel
{
for ( int i = 0; i < n; i++ ) {
l_f[tid][i] = 0;
}
#pragma omp for schedule(static) nowait
for ( int i = 0; i < n; i++ ) {
for ( int j = 0; j < m; j++ ) {
int ii, jj;
if ( j < i ) {
ii = n-1-i;
jj = n-1-j;
}
else {
ii = i;
jj = j+1;
}
double fij = force( ii, jj );
l_f[tid][ii] += fij;
l_f[tid][jj] -= fij;
}
}
if ( n%2 == 0 ) {
#pragma omp for schedule(static) nowait
for ( int i = 0; i < n/2; i++ ) {
double fij = force( i, n/2 );
l_f[tid][i] += fij;
l_f[tid][n/2] -= fij;
}
}
#pragma omp critical
for ( int i = 0; i < n; i++ ) {
f[i] += l_f[tid][i];
}
}
}
void loop3( int n ) {
#pragma omp parallel for schedule(static)
for ( int i = 0; i < n; i++ ) {
for ( int j = 0; j < n; j++ ) {
if ( i < j ) {
f[i] += force( i, j );
}
else if ( i > j ) {
f[i] -= force( i, j );
}
}
}
}
void loop4( int n ) {
#pragma omp parallel
{
for ( int i = 0; i < n; i++ ) {
l_f[tid][i] = 0;
}
#pragma omp for schedule(static) nowait
for ( int k = 0; k < vi.size(); k++ ) {
int i = vi[k];
int j = vj[k];
double fij = force( i, j );
l_f[tid][i] += fij;
l_f[tid][j] -= fij;
}
#pragma omp critical
for ( int i = 0; i < n; i++ ) {
f[i] += l_f[tid][i];
}
}
}
int main( int argc, char *argv[] ) {
if ( argc != 2 ) {
std::cout << "need the dim as argument\n";
return 1;
}
int n = std::atoi( argv[1] );
// initialise data
f.resize( n );
x.resize( n );
y.resize( n );
for ( int i = 0; i < n; ++i ) {
x[i] = y[i] = i;
f[i] = 0;
}
// initialising linearised index vectors
for ( int i = 0; i < n-1; i++ ) {
for ( int j = i+1; j < n; j++ ) {
vi.push_back( i );
vj.push_back( j );
}
}
// initialising the local forces vectors
#pragma omp parallel
{
tid = omp_get_thread_num();
#pragma master
numth = omp_get_num_threads();
}
l_f.resize( numth );
for ( int i = 0; i < numth; i++ ) {
l_f[i].resize( n );
}
// testing all methods one after the other, with a warm up before each
int lmax = 10000;
loop1( n );
double tbeg = omp_get_wtime();
for ( int l = 0; l < lmax; l++ ) {
loop1( n );
}
double tend = omp_get_wtime();
std::cout << "Time for triangular loop is " << tend-tbeg << "s\n";
loop2( n );
tbeg = omp_get_wtime();
for ( int l = 0; l < lmax; l++ ) {
loop2( n );
}
tend = omp_get_wtime();
std::cout << "Time for mangled rectangular loop is " << tend-tbeg << "s\n";
loop3( n );
tbeg = omp_get_wtime();
for ( int l = 0; l < lmax; l++ ) {
loop3( n );
}
tend = omp_get_wtime();
std::cout << "Time for naive rectangular loop is " << tend-tbeg << "s\n";
loop4( n );
tbeg = omp_get_wtime();
for ( int l = 0; l < lmax; l++ ) {
loop4( n );
}
tend = omp_get_wtime();
std::cout << "Time for linearised loop is " << tend-tbeg << "s\n";
int ret = f[n-1];
return ret;
}
Now, it becomes simple to evaluate their relative performance and scalability.
All methods are timed on a loop, after a first non-timed warm-up iteration.
Compilation:
g++ -O3 -mtune=native -march=native -fopenmp tbf.cc -o tbf
Results on a 8 cores IvyBridge CPU:
> OMP_NUM_THREADS=1 numactl -N 0 -m 0 ./tbf 500
Time for triangular loop is 9.21198s
Time for mangled rectangular loop is 10.1316s
Time for naive rectangular loop is 15.9408s
Time for linearised loop is 10.6449s
> OMP_NUM_THREADS=2 numactl -N 0 -m 0 ./tbf 500
Time for triangular loop is 6.84671s
Time for mangled rectangular loop is 5.13731s
Time for naive rectangular loop is 8.09542s
Time for linearised loop is 5.4654s
> OMP_NUM_THREADS=4 numactl -N 0 -m 0 ./tbf 500
Time for triangular loop is 4.03016s
Time for mangled rectangular loop is 2.90809s
Time for naive rectangular loop is 4.45373s
Time for linearised loop is 2.7733s
> OMP_NUM_THREADS=8 numactl -N 0 -m 0 ./tbf 500
Time for triangular loop is 2.31051s
Time for mangled rectangular loop is 2.05854s
Time for naive rectangular loop is 3.03463s
Time for linearised loop is 1.7106s
So in this case, the method #4 seems the best option with both good performance and very good scalability. Notice that the straightforward triangular approach isn't too bad, thanks to a good load-balancing job from the schedule(auto)
directive. But ultimately, I would encourage you to test with your own workload...
For reference, your initial code, (modified for computing the force()
the exact same way as for the other tests, including the number of OpenMP threads used, but without the need of local force arrays and final reduction, tanks to the recursive approach) gives this:
> OMP_NUM_THREADS=1 numactl -N 0 -m 0 ./recursive 500
Time for recursive method is 9.32888s
> OMP_NUM_THREADS=2 numactl -N 0 -m 0 ./recursive 500
Time for recursive method is 9.48718s
> OMP_NUM_THREADS=4 numactl -N 0 -m 0 ./recursive 500
Time for recursive method is 10.962s
> OMP_NUM_THREADS=8 numactl -N 0 -m 0 ./recursive 500
Time for recursive method is 13.2786