6

I'm implementing a distributed version of the Barnes-Hut n-body simulation in Chapel. I've already implemented the sequential and shared memory versions which are available on my GitHub.

I'm following the algorithm outlined here (Chapter 7):

  1. Perform orthogonal recursive bisection and distribute bodies so that each process has equal amount of work
  2. Construct locally essential tree on each process
  3. Compute forces and advance bodies

I have a pretty good idea on how to implement the algorithm in C/MPI using MPI_Allreduce for bisection and simple message passing for communication between processes (for body transfer). And also MPI_Comm_split is a very handy function that allows me to split the processes at each step of ORB.

I'm having some trouble performing ORB using parallel/distributed constructs that Chapel provides. I would need some way to sum (reduce) work across processes (locales in Chapel), splitting processes into groups and process-to-process communication to transfer bodies.

I would be grateful for any advice on how to implement this in Chapel. If another approach would be better for Chapel that would also be great.

talonmies
  • 70,661
  • 34
  • 192
  • 269
Rok Novosel
  • 913
  • 1
  • 8
  • 18
  • 1
    It may be helpful to distill this question down further. Do you have an excerpt of MPI code or pseudocode (even better) for what you would like to write in Chapel? – ben-albrecht Feb 26 '19 at 15:54
  • 1
    I found a finished C++ implementation of the ORB algorithm here: https://github.com/barkm/n-body/blob/master/tree/orb.cpp The `weight_frac` function (fraction of work below the split) and lines 170-196 (sending bodies to the partner process) are the ones I find problematic. The `weight_frac` function also uses the custom `MPI_Comm` I mentioned (from `MPI_Comm_split`). – Rok Novosel Feb 26 '19 at 18:19
  • Thanks. The question is still pretty large in scope for a stack overflow Q, as it requires someone to read the linked chapter, grok the linked code, and consider how to port to Chapel. It might be better to split this off into separate question(s) about specific parallel construct(s) that you are having trouble with translating to Chapel, e.g. "What is the equivalent of `MPI_` in Chapel?" – ben-albrecht Mar 05 '19 at 17:45

1 Answers1

2

After a lot of deadlocks and crashes I did manage to implement the algorithm in Chapel. It can be found here: https://github.com/novoselrok/parallel-algorithms/tree/75312981c4514b964d5efc59a45e5eb1b8bc41a6/nbody-bh/dm-chapel

I was not able to use much of the fancy parallel equipment Chapel provides. I relied only on block distributed arrays with sync elements. I also replicated the SPMD model.

In main.chpl I set up all of the necessary arrays that will be used to transfer data. Each array has a corresponding sync array used for synchronization. Then each worker is started with its share of bodies and the previously mentioned arrays. Worker.chpl contains the bulk of the algorithm.

I replaced the MPI_Comm_split functionality with a custom function determineGroupPartners where I do the same thing manually. As for the MPI_Allreduce I found a nice little pattern I could use everywhere:

var localeSpace = {0..#numLocales};
var localeDomain = localeSpace dmapped Block(boundingBox=localeSpace);

var arr: [localeDomain] SomeType;
var arr$: [localeDomain] sync int; // stores the ranks of inteded receivers
var rank = here.id;

for i in 0..#numLocales-1 {
    var intendedReceiver = (rank + i + 1) % numLocales;
    var partner = ((rank - (i + 1)) + numLocales) % numLocales;

    // Wait until the previous value is read
    if (arr$[rank].isFull) {
        arr$[rank];
    }

    // Store my value
    arr[rank] = valueIWantToSend;
    arr$[rank] = intendedReceiver;

    // Am I the intended receiver?
    while (arr$[partner].readFF() != rank) {}

    // Read partner value
    var partnerValue = arr[partner];

    // Do something with partner value

    arr$[partner]; // empty

    // Reset write, which blocks until arr$[rank] is empty
    arr$[rank] = -1;
}

This is a somewhat complicated way of implementing FIFO channels (see Julia RemoteChannel, where I got the inspiration for this "pattern"). Overview:

  • First each locale calculates its intended receiver and its partner (where the locale will read a value from)

  • Locale checks if the previous value was read

  • Locales stores a new value and "locks" the value by setting the arr$[rank] with the intended receiver

  • Locale waits while its partner sets the value and sets the appropriate intended receiver

  • Once the locale is the intended receiver it reads the partner value and does some operation on it

  • Then locale empties/unlocks the value by reading arr$[partner]

  • Finally it resets its arr$[rank] by writing -1. This way we also ensure that the value set by locale was read by the intended receiver

I realize that this might be an overly complicated solution for this problem. There probably exists a better algorithm that would fit Chapels global view of parallel computation. The algorithm I implemented lends itself to the SPMD model of computation.

That being said, I think that Chapel still does a good job performance-wise. Here are the performance benchmarks against Julia and C/MPI. As the number of processes grows the performance improves by quite a lot. I didn't have a chance to run the benchmark on a cluster with >4 nodes, but I think Chapel will end up with respectable benchmarks.

Rok Novosel
  • 913
  • 1
  • 8
  • 18
  • 1
    FWIW, here's a simple implementation of an MPI_Reduce and MPI_Allreduce on 1D arrays which I might consider more Chapel-idiomatic than the SPMD approach above: https://gist.github.com/ben-albrecht/e7d6fc1c9bd65e5042e6eec221d24d9a – ben-albrecht Apr 02 '19 at 12:41