N-body algorithm: why is this slower in parallel?

Question

I put together a sample program that mimics the type of data structure I am dealing with. Namely that I have n objects, and I need to iterate between each possible pair once and perform a (symmetric) calculation. This operation involves writing data to both pairs. In serial, this would take the form of a loop like this

for(int i = 0; i < N-1; ++i)
   for(int j = i + 1; j < N; ++j)
      ...

However, it did not take much searching on the Internet to find a "cache oblivious parallel implementation", which I wrote up and reproduced below. This post (which uses Intel TBB) describes the algorithm in detail.

I tried using OpenMP tasks to perform the same thing, and it always runs slower than the serial counterpart (simply compiling without -fopenmp). I compile it with g++ -Wall -std=c++11 -O3 test.cpp -o test. The same is observed with or without -O3; the serial is always faster.

To add a bit more information, in my real application, there are typically between a few hundred and a few thousand elements (variable n in the example below) that I need to loop over in this pair-wise fashion many times. Millions of times. My attempt below tries to simulate that (though I only tried looping 10-100k times).

I've timed this very crudely using time ./test simply because there's so much of a difference. Yes, I know my example is poorly written, and that I am including the time required to create the vector in my example. But time for serial gives me ~30 seconds and over a minute in parallel so I don't think I need to do anything more rigorous just yet.

My question is: Why does the serial do better? Have I done something incorrect in OpenMP? How do I properly correct my mistake(s)? Have I misused tasks? I have a feeling that the recursive tasking has something to do with it, and I tried setting 'OMP_THREAD_LIMIT' to 4, but it did not make a substantial difference. Is there a better way of implementing this using OpenMP?

Note: my question is specifically asking how to fix this particular implementation so that it works properly in parallel. Though if anyone knows an alternative solution to this problem and its proper implementation in OpenMP, I am open to that as well.

#include <vector>
#include <iostream>

std::vector<std::vector<double>> testme;

void rect(int i0, int i1, int j0, int j1)
{
    int di = i1 - j0;
    int dj = j1 - j0;
    constexpr int threshold = 16;
    if(di > threshold && dj > threshold)
    {
        int im = i0 + di/2;
        int jm = j0 + dj/2;
        #pragma omp task
        { rect(i0, im, j0, jm); }
        rect(im, i1, jm, j1);
        #pragma omp taskwait
        
        #pragma omp task 
        { rect(i0, im, jm, j1); }
        rect(im, i1, j0, jm);
        #pragma omp taskwait
    }
    else
    {
        for(int i = i0; i < i1; ++i)
            for(int j = j0; j < j1; ++j) {
                testme[i][j] = 1;
                testme[j][i] = 1;
            }
                
    }
}

void triangle(int n0, int n1)
{
        int dn = n1 - n0;
        if(dn > 1)
        {
            int nm = n0 + dn/2;
            #pragma omp task
            { triangle(n0, nm); }
            triangle(nm, n1);
            #pragma omp taskwait
     
            rect(n0, nm, nm, n1);
        }
}


void calc_force(int nbodies)
{
    #pragma omp parallel num_threads(4)
    #pragma omp single
    triangle(0, nbodies);
}

int main()
{
    int n = 400;
    for(int i = 0; i < n; ++i)
    {
        std::vector<double> temp;
        for(int j = 0; j < n; ++j)
            temp.push_back(0);
        
        testme.push_back(temp);
    }

    // Repeat a bunch of times.
    for(int i = 0; i < 100000; ++i)
        calc_force(n);
        
    return 0;
}

You said you compile with `g++ -Wall -std=c++11 -O3 test.cpp -o test`. That does not enable OpenMP. You need to add the `-fopenmp` flag. — Z boson, Oct 06 '15 at 10:42
The way I worded my prompt is a little strange. I compiled it with -fopenmp to enable it, and without it to disable it (compare against serial version). — pragmatist1, Oct 06 '15 at 12:28
Seems like an interesting problem. Why a task approach instead of using `omp parallel for`? — Z boson, Oct 06 '15 at 12:43
I mean, this reminds me superficially of cholesky decomposition. Can't you break each triangle into smaller triangles with blocks? You keep doing this so that eventually the only part you have to do in parallel is the diagonal and the rest can be done in parallel? — Z boson, Oct 06 '15 at 12:48
Perhaps. I am using a well established method. My question is primarily about the "mechanics" of implementation in OpenMP. I feel that I must have done something wrong (which the answerer below points out), I just don't know how to fix it! — pragmatist1, Oct 06 '15 at 16:13
I meant to write "only part you have to do in *serial* is the diagonal and the rest can be done in parallel." — Z boson, Oct 07 '15 at 09:42
Why bother with OpenMP if you have it working with tbb? tbb is free and much better integrated into C++ than OpenMP (actually, you operate outside the standard if you want to use OpenMP with C++, see [this question](http://stackoverflow.com/questions/13837696/can-i-safely-use-openmp-with-c11))? (I implemented my N-body codes using tbb). — Walter, Oct 09 '15 at 14:45
Unfortunately I cannot use TBB since OpenMP is a project requirement. Otherwise I would use what was in the intel article already :) — pragmatist1, Oct 11 '15 at 15:37

Walter Simson · Answer 1 · 2015-10-10T13:59:44.133

Your current implementation of OMP tasks seems to be completely right in applying the triangle partitioning scheme. It seems that due to the recursive nature of the decomposition, the current code is just creating too many child tasks, calling the recursive triangle program until the condition of dn = 1 is reached (at the bottom of the tree). The granularity is just too high. This is burdening your program with the communication requirements of creating and completing task with ever less benefit of that task creation; therefore overweighing the parallelism benefits. I would try to cut off the recursive triangle task call at a certain dn value greater than 1 (more around 15 I am guessing) and let the last (lowest) task to execute serially.

The thread limit will only limit the number of threads active but not the number of recursive calls or tasks made. I would try a task if or adding an else to your triangle implementation.

Something like this:

#include <vector>
#include <iostream>

std::vector<std::vector<double>> testme;

void rect(int i0, int i1, int j0, int j1)
{
int di = i1 - j0;
int dj = j1 - j0;
constexpr int threshold = 64;
if(di > threshold && dj > threshold)
{
    int im = i0 + di/2;
    int jm = j0 + dj/2;
    #pragma omp task
    { rect(i0, im, j0, jm); }
    rect(im, i1, jm, j1);
    #pragma omp taskwait

    #pragma omp task 
    { rect(i0, im, jm, j1); }
    rect(im, i1, j0, jm);
    #pragma omp taskwait
}
else
{
 // #pragma omp parallel for collapse(2)  (was not implimented during testing)
    for(int i = i0; i < i1; ++i)
        for(int j = j0; j < j1; ++j) {
            testme[i][j] = 1;
            testme[j][i] = 1;
        }
    }
}

void triangle(int n0, int n1)
{
    int dn = n1 - n0;
    if(dn > 1)
    {
        int nm = n0 + dn/2;
        #pragma omp task if(nm > 50 )

        { triangle(n0, nm); }
        triangle(nm, n1);
       #pragma omp taskwait

       rect(n0, nm, nm, n1);
    }
}


void calc_force(int nbodies)
{
#pragma omp parallel num_threads(4)
#pragma omp single
triangle(0, nbodies);
}

int main()
{
int n = 400;
for(int i = 0; i < n; ++i)
{
    std::vector<double> temp;
    for(int j = 0; j < n; ++j)
        temp.push_back(0);

    testme.push_back(temp);
}

// Repeat a bunch of times.
for(int i = 0; i < 100000; ++i)
    calc_force(n);

return 0;
}

NOTE: It also might very well be the case that this implementation only shows speed up at scale, where the task overhead is outweighed by the calculation intensity of your program.

Thanks. I went through that presentation. I guess it is not clear to me how to solve it. If I wrap both individual tasks before the taskwait, then it takes even longer (and I'm not even sure if it's executing properly). Perhaps you can comment on how I can fix this? — pragmatist1, Oct 05 '15 at 21:30
So it looks like you can parallelism your first for loop of 400 iterations for initializing the bodies in your system with a simple #pragma omp parallel for. This will help, but not much since your time itterations are the real killer as far as computation power required. This is where N-Body problems get difficult and I too am still learning. There is no simple answer. I would recommend you read up on the linked cell and barnes hut algorithms as well as short range potentials as they relate to N-Body Problems. — Walter Simson, Oct 06 '15 at 16:54
I would recommend you ask another separate question about the dependencies one has to keep an eye on and how the program should/could be written while limiting dependencies. — Walter Simson, Oct 06 '15 at 17:03
Thanks for your feedback. I believe solution as formulated above is well suited for my problem, and that my primary question is regards to the proper implementation of *this particular* algorithm in OpenMP (as opposed to dealing with n-body problems in general). I have updated my question to reflect this. — pragmatist1, Oct 06 '15 at 17:11
Can you elaborate on what rest and triangle do? I would be glad to help if I could gain some insight into your code. If you for example were to make a task for every body in the system per time step and then write the result of the force calculation to a new variables not order to avoid race conditions that would be one solution instead of making a task for each time step force calculation as a whole as is the case here. — Walter Simson, Oct 06 '15 at 22:06
Follow the thread I posted in my original question, which describes the algorithm in great detail. I've linked the first relevant thread in the series here: https://software.intel.com/en-us/blogs/2010/07/01/n-bodies-a-parallel-tbb-solution-parallel-code-a-fresh-look-using-recursive-parallelism — pragmatist1, Oct 06 '15 at 22:19
Sorry for the confusion. I have read up on the method (really enjoyed it) and adjusted my answer accordingly. I will try to give it a try myself tomorrow. — Walter Simson, Oct 07 '15 at 22:36

score 5 · Accepted Answer · edited Mar 09 '23 at 18:25

The simple idea of using a recursive algorithm for such a workload seems already very strange to me. And then, to parallelise it using OpenMP tasks seems even stranger... Why not tackling the problem with a more conventional approach?

So I decided to give it a go with a few methods that came to my mind. But to make the exercise sensible, it was important that some actual work was done for computing the "symmetric calculation", otherwise, just iterating on a all elements without accounting for the symmetrical aspect would certainly be the best option.

So I wrote a a force() function computing something loosely related to gravitational interactions between two bodies, based on their coordinates. Then I tested 4 different methods to iterate on the particles:

A naïve triangular approach such as the one you proposed. Due to it's intrinsic load-unbalanced aspect, this one is parallelised with a schedule(auto) clause to permit the run-time library to take the decision it thinks best for performance.
A "clever" traversal of the triangular domain, consisting in cutting it in half in the j direction to allow for using 2 regular loops. Basically, it corresponds to to something like this:
```
   /|
  / |       __  __
 /  |  =>  | //   |
/___|      |//____|
```
A straightforward rectangular approach, just ignoring the symmetry. NB, this one, like your recursive approach, guarantees non-concurrent accesses to the force array.
A linearised method consisting in pre-computing the order of i and j indexes for accessing the triangular domain, and to iterate over the vector containing these indexes.

Since the vector where the forces are accumulated with a force[i] += fij; force[j] -= fij; approach would generate race conditions for the updates in the non-parallelised index (j for example in the method #1), I've created a local pre-thread force array, which is initialised to 0 upon entry to the parallel region. The computations are then done pre-thread on this "private" array, and the individual contributions are accumulated on the global force array with a critical construct upon exit of the parallel region. This is a typical reduction pattern for arrays in OpenMP.

Here is the full code for this:

#include <iostream>
#include <vector>
#include <cstdlib>
#include <cmath>
#include <omp.h>

std::vector< std::vector<double> > l_f;
std::vector<double> x, y, f;
std::vector<int> vi, vj;

int numth, tid;
#pragma omp threadprivate( tid )

double force( int i, int j ) {
    double dx = x[i] - x[j];
    double dy = y[i] - y[j];
    double dist2 = dx*dx + dy*dy;
    return dist2 * std::sqrt( dist2 );
}

void loop1( int n ) {
    #pragma omp parallel
    {
        for ( int i = 0; i < n; i++ ) {
            l_f[tid][i] = 0;
        }
        #pragma omp for schedule(auto) nowait
        for ( int i = 0; i < n-1; i++ ) {
            for ( int j = i+1; j < n; j++ ) {
                double fij = force( i, j );
                l_f[tid][i] += fij;
                l_f[tid][j] -= fij;
            }
        }
        #pragma omp critical
        for ( int i = 0; i < n; i++ ) {
            f[i] += l_f[tid][i];
        }
    }
}

void loop2( int n ) {
    int m = n/2-1+n%2;
    #pragma omp parallel
    {
        for ( int i = 0; i < n; i++ ) {
            l_f[tid][i] = 0;
        }
        #pragma omp for schedule(static) nowait
        for ( int i = 0; i < n; i++ ) { 
            for ( int j = 0; j < m; j++ ) {
                int ii, jj;
                if ( j < i ) {
                    ii = n-1-i;
                    jj = n-1-j;
                }
                else {
                    ii = i;
                    jj = j+1;
                }
                double fij = force( ii, jj );
                l_f[tid][ii] += fij;
                l_f[tid][jj] -= fij;
            }
        }
        if ( n%2 == 0 ) {
            #pragma omp for schedule(static) nowait
            for ( int i = 0; i < n/2; i++ ) {
                double fij = force( i, n/2 );
                l_f[tid][i] += fij;
                l_f[tid][n/2] -= fij;
            }
        }
        #pragma omp critical
        for ( int i = 0; i < n; i++ ) {
            f[i] += l_f[tid][i];
        }
    }
}

void loop3( int n ) {
    #pragma omp parallel for schedule(static)
    for ( int i = 0; i < n; i++ ) {
        for ( int j = 0; j < n; j++ ) {
            if ( i < j ) {
                f[i] += force( i, j );
            }
            else if ( i > j ) {
                f[i] -= force( i, j );
            }
        }
    }
}

void loop4( int n ) {
    #pragma omp parallel
    {
        for ( int i = 0; i < n; i++ ) {
            l_f[tid][i] = 0;
        }
        #pragma omp for schedule(static) nowait
        for ( int k = 0; k < vi.size(); k++ ) {
            int i = vi[k];
            int j = vj[k];
            double fij = force( i, j );
            l_f[tid][i] += fij;
            l_f[tid][j] -= fij;
        }
        #pragma omp critical
        for ( int i = 0; i < n; i++ ) {
            f[i] += l_f[tid][i];
        }
    }
}

int main( int argc, char *argv[] ) {
    if ( argc != 2 ) {
        std::cout << "need the dim as argument\n";
        return 1;
    }
    int n = std::atoi( argv[1] );

    // initialise data
    f.resize( n );
    x.resize( n );
    y.resize( n );
    for ( int i = 0; i < n; ++i ) {
        x[i] = y[i] = i;
        f[i] = 0;
    }

    // initialising linearised index vectors
    for ( int i = 0; i < n-1; i++ ) {
        for ( int j = i+1; j < n; j++ ) {
            vi.push_back( i );
            vj.push_back( j );
        }
    }
    // initialising the local forces vectors
    #pragma omp parallel
    {
        tid = omp_get_thread_num();
        #pragma master
        numth = omp_get_num_threads();
    }
    l_f.resize( numth );
    for ( int i = 0; i < numth; i++ ) {
        l_f[i].resize( n );
    }

    // testing all methods one after the other, with a warm up before each
    int lmax = 10000;
    loop1( n );
    double tbeg = omp_get_wtime();
    for ( int l = 0; l < lmax; l++ ) {
        loop1( n );
    }
    double tend = omp_get_wtime();
    std::cout << "Time for triangular loop is " << tend-tbeg << "s\n";

    loop2( n );
    tbeg = omp_get_wtime();
    for ( int l = 0; l < lmax; l++ ) {
        loop2( n );
    }
    tend = omp_get_wtime();
    std::cout << "Time for mangled rectangular loop is " << tend-tbeg << "s\n";

    loop3( n );
    tbeg = omp_get_wtime();
    for ( int l = 0; l < lmax; l++ ) {
        loop3( n );
    }
    tend = omp_get_wtime();
    std::cout << "Time for naive rectangular loop is " << tend-tbeg << "s\n";

    loop4( n );
    tbeg = omp_get_wtime();
    for ( int l = 0; l < lmax; l++ ) {
        loop4( n );
    }
    tend = omp_get_wtime();
    std::cout << "Time for linearised loop is " << tend-tbeg << "s\n";

    int ret = f[n-1];
    return ret;
}

Now, it becomes simple to evaluate their relative performance and scalability. All methods are timed on a loop, after a first non-timed warm-up iteration.

Compilation:

g++ -O3 -mtune=native -march=native -fopenmp tbf.cc -o tbf

Results on a 8 cores IvyBridge CPU:

> OMP_NUM_THREADS=1 numactl -N 0 -m 0 ./tbf 500
Time for triangular loop is 9.21198s
Time for mangled rectangular loop is 10.1316s
Time for naive rectangular loop is 15.9408s
Time for linearised loop is 10.6449s
> OMP_NUM_THREADS=2 numactl -N 0 -m 0 ./tbf 500
Time for triangular loop is 6.84671s
Time for mangled rectangular loop is 5.13731s
Time for naive rectangular loop is 8.09542s
Time for linearised loop is 5.4654s
> OMP_NUM_THREADS=4 numactl -N 0 -m 0 ./tbf 500
Time for triangular loop is 4.03016s
Time for mangled rectangular loop is 2.90809s
Time for naive rectangular loop is 4.45373s
Time for linearised loop is 2.7733s
> OMP_NUM_THREADS=8 numactl -N 0 -m 0 ./tbf 500
Time for triangular loop is 2.31051s
Time for mangled rectangular loop is 2.05854s
Time for naive rectangular loop is 3.03463s
Time for linearised loop is 1.7106s

So in this case, the method #4 seems the best option with both good performance and very good scalability. Notice that the straightforward triangular approach isn't too bad, thanks to a good load-balancing job from the schedule(auto) directive. But ultimately, I would encourage you to test with your own workload...

For reference, your initial code, (modified for computing the force() the exact same way as for the other tests, including the number of OpenMP threads used, but without the need of local force arrays and final reduction, tanks to the recursive approach) gives this:

> OMP_NUM_THREADS=1 numactl -N 0 -m 0 ./recursive 500
Time for recursive method is 9.32888s
> OMP_NUM_THREADS=2 numactl -N 0 -m 0 ./recursive 500
Time for recursive method is 9.48718s
> OMP_NUM_THREADS=4 numactl -N 0 -m 0 ./recursive 500
Time for recursive method is 10.962s
> OMP_NUM_THREADS=8 numactl -N 0 -m 0 ./recursive 500
Time for recursive method is 13.2786

Thank you very much for your detailed examples! Going through them I realized I made a crucial error setting up my example. The calculated force in reality is added to the force on object `i` and on object `j`. Meaning that no more than one write can occur at a time to items `i` or `j`. This would amount to having the force do this `force[i] += fij; force[j] -= fij;`. In your example, this would cause a race. Would you be able to adjust for this? If not, I understand since you adressed the question as posed. — pragmatist1, Oct 10 '15 at 15:41
Adjusted the algorithms and times, for both your reference code and my "solutions". Actually, times and scalabilities are even better, thanks to the fact that we do not need to access `testme[j][i]` any more. This was a performance killer, due the false-sharing implied. Now, the caching works perfectly. — Gilles, Oct 11 '15 at 04:49
BTW, I believe your code as a bug: I suspect the very first`int di = i1 - j0;` should read `int di = i1 - i0;` — Gilles, Oct 11 '15 at 04:51
Thank you very much! And yea, it looks like it. I'll correct that. — pragmatist1, Oct 11 '15 at 15:33
Task based parallelism is necessarily recursive -- there is nothing strange about that. This particular problem is so small, however, that other methods than those suitable for large problems should be considered. For example, methods which afford a copy of all forces for each thread, such as considered in this answer. — Walter, Oct 11 '15 at 18:49

amdn · Answer 3 · 2015-10-14T17:30:44.593

One of the reasons the parallel code is not living up to its potential is due to a problem known as false sharing (Wikipedia).

The solution is to partition the problem so that each cache line in your output 2-D matrix (the inner vector) is updated by only one thread. That is true for the triangles by construction, the partition already guarantees that. But the parallelism in the rectangle is problematic if im and jm are not the index of entries that are aligned at a cache-line boundary. If the partition indicated by im and/or jm is not at a cache line boundary, then both threads will be writing to common cache lines but different offsets within the cache line - the definition of false sharing.

This article by Intel has a good description of false sharing and advice on how to avoid it.

https://software.intel.com/en-us/articles/avoiding-and-identifying-false-sharing-among-threads

I quote the relevant section for reference:

False sharing is a well-known performance issue on SMP systems, where each processor has a local cache. It occurs when threads on different processors modify variables that reside on the same cache line, as illustrated in Figure 1. This circumstance is called false sharing because each thread is not actually sharing access to the same variable. Access to the same variable, or true sharing, would require programmatic synchronization constructs to ensure ordered data access.

The source line shown in red in the following example code causes false sharing:
double sum=0.0, sum_local[NUM_THREADS];
#pragma omp parallel num_threads(NUM_THREADS)
{
 int me = omp_get_thread_num();
 sum_local[me] = 0.0;

 #pragma omp for
 for (i = 0; i < N; i++)
 sum_local[me] += x[i] * y[i];

 #pragma omp atomic
 sum += sum_local[me];
}  
There is a potential for false sharing on array sum_local. This array is dimensioned according to the number of threads and is small enough to fit in a single cache line. When executed in parallel, the threads modify different, but adjacent, elements of sum_local (the source line shown in red), which invalidates the cache line for all processors.

Figure 1. False sharing occurs when threads on different processors modify variables that reside on the same cache line. This invalidates the cache line and forces a memory update to maintain cache coherency.

In Figure 1, threads 0 and 1 require variables that are adjacent in memory and reside on the same cache line. The cache line is loaded into the caches of CPU 0 and CPU 1 (gray arrows). Even though the threads modify different variables (red and blue arrows), the cache line is invalidated, forcing a memory update to maintain cache coherency.

Here's a lecture on the pros and cons of various ways of organizing the N-Body algorithm with OpenMP, but note the caveats by Walter in the comments, this lecture is more about programming than about physics, as Walter notes in an N-Body problem where some particles have a close encounter the computation of forces to determine the net force on a body (acceleration) and the integration to determine velocity and then again for position must be done carefully - a global step function is not appropriate.

http://www.cs.usask.ca/~spiteri/CMPT851/notes/nBody.pdf

In particular, see page 18, which I copy here for reference:

Applying Foster’s methodology Hence, the map of tasks to cores reduces to mapping particles to cores. Assuming the work done per step is roughly equal, a block partitioning that assigns roughly n/p particles per core should provide a well-balanced load. This assumption is valid for the case where symmetry is not taken advantage of when computing fi,j(t). When symmetry is taken advantage of, the loops for lower i will more expensive than those for larger i. In this case, a cyclic partition is more effective. However, in a shared-memory framework, a cyclic partition is almost certain to lead to a higher number of cache misses than a block partition. In a distributed-memory framework, the communication overhead involved with a cyclic partition will probably be greater than that for a block partition

That lecture for about the N-body method is **horrible** concerning the time integration. When using unsoftened forces, one must be extremely careful to accurately integrate close encounters and using a global time step is absolute rubbish, even with softened forces. The lecture may be useful when it comes to ways of computing the mutual forces for all particles simultaneously, though the necessity of that is not discussed (often approximative methods which are much faster but also much more complicated could be used instead). — Walter, Oct 14 '15 at 17:00
@Walter I'm very glad you offer those caveats, I know more about programming than physics. At the very least I'll make a note in my answer regarding your comment, but I'm also willing to take it out completely if you think it is more likely to lead someone astray than to help - Thanks — amdn, Oct 14 '15 at 17:04
No, leave your answer. I had a more closer look at that lecture and was surprised that a task-based approach was not discussed. In situations where the load cannot be easily spread by simple partitioning (e.g. if in the N-body problem only a few localised particles with short time step require a force) task-based parallelism is often the only efficient way. — Walter, Oct 14 '15 at 17:11

Walter · Answer 4 · 2015-10-09T16:22:17.167

The art of task-based parallelism is to avoid both under- and over-subscription. This means that at some point one has to execute a task serially, since parallel execution becomes too slow due to the overheads (see also the discussion here). When this point is reached depends on the amount of work and the task scheduler.

In your rect() function, you already use a threshold to limit task-parallel execution to regions with more than threshold elements per side. But strange enough, you don't do this in triangle(). So my first line of attack would be to experiment with a similar technique in that routine too.

void triangle(int n0, int n1, const int threshold)
{
    int dn = n1 - n0;
    if(dn > threshold)
    {
        int nm = n0 + dn/2;
        #pragma omp task
        { triangle(n0, nm, threshold); }
        triangle(nm, n1, threshold);
        #pragma omp taskwait
        rect(n0, nm, nm, n1, threshold);  // pass threshold on
    } else {
        for(int i = n0; i < n1; ++i)
            for(int j = i+1; j < n1; ++j) {   // excludes self-interactions
                auto fij  = mutual_force(i,j);
                force[i] += fij;
                force[j] -= fij;
            }
    }
}

Note that I made threshold a run-time variable. This allows you to experiment with it to see how sensitive the timing depends on it. The usual dependence is a long valley with good scaling, but bad results for too large or too small values. For a good task scheduler, you want to generate many more tasks than threads, but also threshold much larger than 1, say 64-1024.

Of course, there is a squeeze between these two requirements: you cannot efficiently scale small problems to many threads, strong scaling has its limits (N operations cannot be shared between more than N threads).

It may well be that your problem is too small to parallelize efficiently in this way, in particular with only a few hundred particles. An alternative parallelisation strategy is to compute the forces for each pair twice and use simple for-loop-based parallelism

#pragma omp parallel for
for(int i=0; i<n; ++i) {
    force[i] = 0;
    for(int j=0; j<n; ++j)
        force[i] += mutual_force(i,j);
}

The compiler will find this very easy to optimise and static parallelism may be just fine too.

score 0 · Answer 5 · answered Oct 08 '15 at 19:39

I guess you see suboptimal cache usage. Memory nowadays is much slower than the CPU - multiple orders of magnitude.

When thread A takes byte 1, thread B takes byte 2, and thread C takes byte 3, then the three CPU cores will each have to get a complete cache line into their L1 cache, just to use one byte from it. The CPUs will have to make sure, that the caches are coherent and since the data at the beginning of the cache line is available before anything else, they will make progress with different speeds. Then may be start to thrash the common higher level caches.

On the other hand, in the single threaded version, the one CPU core has to do all the work, but it gets the best memory access available: predictable ascending access and always use the complete cache line.

How to solve issues like that? Well, you have already started to do, what is most important in performance engineering: measure. Maybe you can solve the issue by making sure, that each thread works on a complete cache line, while another thread gets a different one. I don't know if OpenMP supports fine grained control over the distribution of the workload on the threads. Measure that one and see if it helps.

This may be, but then why is it that for the example shown using Intel TBB, the parallel implementation shown above out-performs serial at about n=64 (I'm running n=400+)? (see here: https://software.intel.com/en-us/blogs/2010/07/23/n-bodies-a-parallel-tbb-solution-parallel-code-with-parallel_invoke-will-it-run-faster) What implementation detail(s) of OpenMP or my code make it noticeably slower than its TBB counterpart for the same algorithm? That's what I'd like to know... — pragmatist1, Oct 08 '15 at 20:27
I am sorry, but I can only guess here. Maybe TBB does a better job in optimizing for cache usage while distributing the workload on the threads, than OpenMP does. — cdonat, Oct 09 '15 at 08:26

N-body algorithm: why is this slower in parallel?

5 Answers5

Linked