4

Recently, I came across Chapel. I liked the examples given in the tutorials but many of them were embarrassingly parallel in my eyes. I'm working on Scattering Problems in Many-Body Quantum Physics and a common problem can be reduced to the following.

  1. A tensor A of a shape M x N x N is filled with the solution of a Matrix equation for M different parameters 1..M
  2. A subset of the Tensor A is needed to compute a correction term for each of the parameters 1..M.

The first part of the Problem is embarrassingly parallel.

My question is thus if and how it is possible to transfer only the needed subset of the tensor A to each of the locales of a cluster and minimize the necessary communication?

user3666197
  • 1
  • 6
  • 50
  • 92
CKl
  • 152
  • 1
  • 7
  • 1
    Hi @CKI — A few questions in hopes of making my answer as aligned with your scenario as possible: (1) Do you want `A` to be distributed across locales to begin with, or local to a single locale? (2) Is the work in step 2 roughly the same for each parameter in 1..M, or will it vary? (3) If the answer to Q1 is "distributed", do you want the computations in step 2 to act on their local chunk of A, or potentially on remote or distributed chunks? Thanks. – Brad Jun 08 '20 at 18:54
  • 1
    Hi @Brad, thanks for you comment. (1) Prefarable A is distributed across the locals as it can easily grow to a couple of TBs. (2) It's about the same effort. (3) The second step needs the solutions that are located in remote chunks but not all of them. – CKl Jun 08 '20 at 18:59
  • BTW, I'm not a SO guru, but your question might be more valuable to others if it were more specific to the topic being asked about since there are many different kinds of complicated parallel computations that someone might want to know about. E.g., "Efficient transfer of sub-arrays in Chapel"? or "Efficient localization of ..."? – Brad Jun 08 '20 at 20:54
  • Thanks for you suggestion. Yes, the title should be more specific. – CKl Jun 08 '20 at 21:08

1 Answers1

3

When Chapel is doing its job right, transfers of array slices between distributed and local arrays (say) should be performed in an efficient manner. This means that you should be able to write such tensor-subset transfers using Chapel's array slicing notation.

For example, here's one way to write such a pattern:

// define a domain describing a 5 x 7 x 3 index set anchored at index (x,y,z)
const Slice = {x..#5, y..#7, z..#3};

// create a new array variable that stores the elements from distributed array 
// `myDistArray` locally
var myLocalArray = myDistArray[Slice];

The new variable myLocalArray will be an array whose elements are copies of the ones in myDistArray as described by the indices in Slice. The domain of myLocalArray will be the slicing domain Slice, so since Slice is a non-distributed domain, myLocalArray will also be a local / non-distributed array, and therefore won't incur any of the overheads of using Chapel's distributed array notation when it's operated on from the current locale.

To date, we have focused principally on optimizing such transfers for Block-distributed arrays. For example, for cases like the above example, when myDistArray is Block-distributed, I'm seeing a fixed number of communications between the locales as I vary the size of the slice (though the size of those communications would obviously vary depending on the number of elements that need to be transferred). Other cases and patterns are known to need more optimization work, so if you find a case that isn't performing / scaling as you'd expect, please file a Chapel GitHub issue against it to help alert us to your need and/or help you find a workaround.

So, sketching out the pattern you describe, I might imagine doing something like:

// create a local and distributed version of the complete tensor space
const LocTensorSpace = {1..M, 1..N, 1..N},
      TensorSpace = LocTensorSpace dmapped Block(LocTensorSpace);

// declare array A to store the result of step 1
var A: [TensorSpace] real;

// ...compute A here...

// declare a 1D distributed form of the parameter space to drive step 2    
const ParameterSpace = {1..M} dmapped Block({1..M});

// loop over the distributed parameter space; each locale will use all its cores
// to compute on its subset of {1..M} in parallel
forall m in ParameterSpace {
  // create a local domain to describe the indices you want from A
  const TensorSlice = { /* ...whatever indices you need here... */ };

  // copy those elements into a local array
  var locTensor = A[TensorSlice];

  // ...compute on locTensor here...
}

Some other things that seem related to me, but which I don't want to bog this question down with are:

  • If desired, TensorSpace / A could be declared such that only the 1..M dimension is distributed across locales and the {1..N, 1..N} planes are local
  • There are also ways to query what indices of a distributed array a locale owns; combined with the previous point, this could be a way to reduce the amount of communication required assuming there's a correspondence between the iterations of step 2 and the planes of A
  • There are also ways to refer to a distributed array slice in-place and/or to give it a symbolic name rather than creating a local copy of it as suggested above
  • If desired/preferred A could be declared as a 1D distributed array of 2D arrays, though this may not be as nice if you want to access 3D slices of the space

(So feel free to ask follow-up questions if these are of interest)

Finally, for the sake of posterity, here's the program I wrote up while I was putting this response together to make sure I'd get the behavior I expected in terms of numbers of communications and getting a local array (this was with chpl version 1.23.0 pre-release (ad097333b1), though I'd expect the same behavior for recent releases of Chapel:

use BlockDist, CommDiagnostics;

config const M = 10, N=20;

const LocTensorSpace = {1..M, 1..N, 1..N},
      TensorSpace = LocTensorSpace dmapped Block(LocTensorSpace);

var A: [TensorSpace] real;

forall (i,j,k) in TensorSpace do
  A[i,j,k] = i + j / 100.0 + k / 100000.0;


config const xs = 5, ys = 7, zs = 3,            // size of slice                
             x = M/2-xs/2, y = N/2-ys/2, z = N/2-zs/2;  // origin of slice      


const Slice = {x..#xs, y..#ys, z..#zs};

writeln("Copying a ", (xs,ys,zs), " slice of A from ", (x,y,z));

resetCommDiagnostics();
startCommDiagnostics();

var myLocArr = A[Slice];

stopCommDiagnostics();
writeln(getCommDiagnostics());

writeln(myLocArr);
writeln(myLocArr.isDefaultRectangular());
Brad
  • 3,839
  • 7
  • 25
  • First of all, thank you for this effort. It's definitely a good lead for me. I will probably need some time to digest as I'm new to the language. To your points. 1. 1..M should be distributed over the locals as the A[i,:,:] is the solution of the matrix equation. 2. I can define what indexes of the tensor each i in 1..M needs for the second step. So I think that should be doable. 3. This is desirable as it would decrease the memory usage. 4. What would be the advantage of this? I think it would not hurt as the indices needed will not necessarily be a slice. – CKl Jun 08 '20 at 21:26
  • Addressing points 1 and 4: The two main ways to only distribute the 1..M dimension are to declare it as a 3D array distributed to a 3D array of locales that is degenerate in the other dimensions (see https://stackoverflow.com/questions/53369244/distribute-2d-array-row-wise-among-locales-in-chapel); the other would be to declare it as a distributed 1D array of local 2D arrays. The tradeoffs tend to be primarily in style and somewhat in performance. If it's most natural to think of the space as contiguous in 3D, go with the first approach; if a collection of 2D things, use the second. – Brad Jun 08 '20 at 22:03
  • We might take this follow-up conversation to gitter or email, though (or new SO questions) as it'll likely become more complicated than SO comments support well (or encourage). – Brad Jun 08 '20 at 22:04
  • Agreed, I'll get something running first and will follow up with new questions. – CKl Jun 08 '20 at 22:08