I am trying to improve my run times with Chapel for matrix multiplication

Question

I am trying to improve my matrix multiplication speed. Is there other implementations I can do that would speed it up Here are my results so far, I tried doing 8192 but it took over 2 hours and my ssh connection timed out.

here is my implementation:

use Random, Time;
var t : Timer;
t.start();

config const size = 10;
var grid : [1..size, 1..size] real;
var grid2 : [1..size, 1..size] real;
var grid3 : [1..size, 1..size] real;

fillRandom(grid);
fillRandom(grid2);

//t.start();
forall i in 1..size {
    forall j in 1..size {
        forall k in 1..size {
            grid3[i,j] += grid[i,k] * grid2[k,j];
        }
    }
}
t.stop();
writeln("Done!:");
writeln(t.elapsed(),"seconds");
writeln("Size of matrix was:", size);
t.clear();

I am comparing times to an MPI implementation in c++. I am wondering if there is a way to distribute my matrix to my two locales similar to MPI?

There ought be an option to send "heartbeats" over the ssh-connection, so as to keep your research benchmarks scale above hours. Also the measured times would be fair to present in either a scientific notation 8.97355E-1 or fixed format 20.6f, given a known ( seemingly not under a [us] ) resolution of the used (monotonic) clock. The clocked t{ .start() | .stop() }-section ought avoid all computing non-related operations ( here, definitely the both fillRandom()-ops ) so as to compare apples to apples in the Problem-under-Review ( the scaling, not the random-generation / RAM-I/O during fill-s ) — user3666197, Dec 11 '19 at 07:17

Vass · Accepted Answer · 2019-12-12T18:57:54.097

3

This kind of nesting of the forall loops does not give the best performance in our current implementation. Your algorithm will execute faster if you iterate over a single 2-d domain that defines the iteration space of (i,j). Having a serial loop over k will avoid the data race on updating grid3[i,j]. For example:

....
const D2 = {1..size, 1..size};
forall (i,j) in D2 do
  for k in 1..size do
    grid3[i,j] += grid[i,k] * grid2[k,j];

To distribute your matrices, you could use for example the Block distribution - see the example in our online docs. When you distribute, you will of course need to be mindful of the additional communication between locales.

When testing performance, be sure to compile with --fast .

edited Dec 12 '19 at 18:57

answered Dec 11 '19 at 03:28

Vass

431
2
4

3

P.S. See also our distribution primer: https://chapel-lang.org/docs/primers/distributions.html – Vass Dec 11 '19 at 03:50
1

I disagree with the comment that forall loops do not nest well, but agree that, for a shared-memory code, iterating over a 3D domain is probably preferable. For distributed memory, I think you're going to want to do more than use the Block distribution as suggested here. A complete algorithm change to something designed for distributed memory is likely to be necessary to get good performance. – Brad Dec 11 '19 at 18:51
Having revised your original solo-D3-do-{} interator-domain into a forall-in-D2-for(k)-do-{} tandem-iterator, you might notice **the performance has dramatically decreased** on a single locale device for small matrix sizes (larger sizes remain not test-able without @Brad's or other Cray Chapel Team members' help to confirm this alongside the problem-size further scaling, due to a known ~60 [s] limit for running Chapel tasks on TiO.RUN public platform) Measured execution times for either of use-cases for --size={128|256|512|640} that fit are posted **here** https://stackoverflow.com/a/59338802 – user3666197 Dec 15 '19 at 00:55

score 2 · Answer 2 · answered Dec 11 '19 at 19:16

In Chapel, forall loops do not automatically distribute work or data across distinct locales (think: compute nodes or memories). Instead, the forall loop invokes the parallel iterator associated with the thing over which you're iterating.

So, if you're iterating over something that is local to a single locale like a range (as in your code's use of 1..size) or a non-distributed domain or array (like grid in your code), all of the tasks used to implement the parallel loop will execute locally on the original locale. In contrast, if you are iterating over a distributed domain or array (e.g., one that is Block-distributed), or invoking a distributed iterator (e.g., one from the DistributedIters module), the tasks will be distributed across all the locales that the iterand targets.

As a result, any Chapel program that does not refer to other locales—whether explicitly via on-clauses or implicitly via abstractions that wrap on-clauses, like the distributed arrays and iterators mentioned above—will never use resources other than the initial locale's.

I also wanted to offer a side note on distributed algorithms: Even if you were to update your program above to distribute the grid arrays and forall loops across multiple locales, the triply-nested loop approach is rarely an optimal matrix multiplication algorithm on distributed memory systems, as it doesn't optimize for locality very well. Better might be to investigate matrix multiplication algorithms designed for distributed memory (e.g., SUMMA).

would you mind, Brad, to kindly re-test the both code strategies, the original and the Vass' one - on the growing scale using **growing `--size={ 128, 256, 512, 1024, 2048, 4096, 8192 }`** and validating the performance-related surprise after { using | not using } the **`--ccflags -O3`** ? As for now, the TiO.run platform operations are limited to a block of 60 [s] ( The dojo-coders platform even less ~ 20 [s] ) to cover and collect any reasonable performace / size dependency mappings. **Thank you for your kind consideration to take the steps and sharing the actual performance results** — user3666197, Dec 13 '19 at 20:21
no problem, never mind. It all was just not more than a one copy/paste ( the Chapel code as-is, compile-ready below ) + launching a one shell-script one-liner for the sizes + then copy/paste the reported rows of text. — user3666197, Dec 16 '19 at 20:37

user3666197 · Answer 3 · 2019-12-13T20:04:39.363

_{Ex-post remark: see the benchmarked Vass' proposed performance edge on the hosted eco-system, would be great to reproduce this on multi-locale silicon so as to prove it universal for larger scales...}

Performance?
Benchmark!_{... always, no exceptions, no excuse - surprises not seldom -ccflags -O3}

This is what makes chapel so great. Thanks a lot the Chapel team for developing and improving such great computing tool for the HPC over more than the last decade.

With a full love in true-[PARALLEL] efforts, the performance is always a result of both the design practices and underlying system hardware, never a just syntax-constructor granted "bonus".

Using a TiO.run online IDE for demo-test, the results for a single locale silicon are these:

TiO.run platform uses   1 numLocales,
               having   2 physical CPU-cores accessible (numPU-s)
                 with   2 maxTaskPar parallelism limit

For grid{1,2,3}[ 128, 128] the tested forall sum-product took       3.124 [us] incl. fillRandom()-ops
For grid{1,2,3}[ 128, 128] the tested forall sum-product took       2.183 [us] excl. fillRandom()-ops
For grid{1,2,3}[ 128, 128] the Vass-proposed sum-product took       1.925 [us] excl. fillRandom()-ops

For grid{1,2,3}[ 256, 256] the tested forall sum-product took      28.593 [us] incl. fillRandom()-ops
For grid{1,2,3}[ 256, 256] the tested forall sum-product took      25.254 [us] excl. fillRandom()-ops
For grid{1,2,3}[ 256, 256] the Vass-proposed sum-product took      21.493 [us] excl. fillRandom()-ops

For grid{1,2,3}[1024,1024] the tested forall sum-product took   2.658.560 [us] incl. fillRandom()-ops
For grid{1,2,3}[1024,1024] the tested forall sum-product took   2.604.783 [us] excl. fillRandom()-ops
For grid{1,2,3}[1024,1024] the Vass-proposed sum-product took   2.103.592 [us] excl. fillRandom()-ops

For grid{1,2,3}[2048,2048] the tested forall sum-product took  27.137.060 [us] incl. fillRandom()-ops
For grid{1,2,3}[2048,2048] the tested forall sum-product took  26.945.871 [us] excl. fillRandom()-ops
For grid{1,2,3}[2048,2048] the Vass-proposed sum-product took  25.351.754 [us] excl. fillRandom()-ops

For grid{1,2,3}[2176,2176] the tested forall sum-product took  45.561.399 [us] incl. fillRandom()-ops
For grid{1,2,3}[2176,2176] the tested forall sum-product took  45.375.282 [us] excl. fillRandom()-ops
For grid{1,2,3}[2176,2176] the Vass-proposed sum-product took  41.304.391 [us] excl. fillRandom()-ops

--fast --ccflags -O3

For grid{1,2,3}[2176,2176] the tested forall sum-product took  39.680.133 [us] incl. fillRandom()-ops
For grid{1,2,3}[2176,2176] the tested forall sum-product took  39.494.035 [us] excl. fillRandom()-ops
For grid{1,2,3}[2176,2176] the Vass-proposed sum-product took  44.611.009 [us] excl. fillRandom()-ops

RESULTS:

On a single-locale ( single, virtualised processor, 2 cores, running all public time-constrained ( < 60 [s] ) time-shared public workloads online IDE ) device the still sub-optimal ( Vass' approach being further about 20% faster ) yields for grid scales 2048x2048 about ~ 4.17 x faster results than the Nvidia Jetson-hosted computing. This performance edge seems to further grow larger for larger memory layouts, as the scope of the memory-I/O grows and CPU-L1/L2/L3-cache pre-fetching will (ought follow the ~ O(n^3) scaling) further improve the CPU-based NUMA platforms' performance edge.

Would be great to see the achievable performance and ~ O(n^3) scaling obedience as being executed on multi-locale devices and native Cray NUMA cluster platforms.

use Random, Time, IO.FormattedIO;
var t1 : Timer;
var t2 : Timer;

t1.start(); //----------------------------------------------------------------[1]

config const size = 2048;

var grid1 : [1..size, 1..size] real;
var grid2 : [1..size, 1..size] real;
var grid3 : [1..size, 1..size] real;

fillRandom(grid1);
fillRandom(grid2);

t2.start(); //====================================[2]

forall i in 1..size {
    forall j in 1..size {
        forall k in 1..size {
            grid3[i,j] += grid1[i,k] * grid2[k,j];
        }
    }
}

t2.stop(); //=====================================[2]
t1.stop(); //-----------------------------------------------------------------[1]
             writef( "For grid{1,2,3}[%4i,%4i] the tested forall sum-product took %12i [us] incl. fillRandom()-ops\n",
                      size,
                      size,
                      t1.elapsed( TimeUnits.microseconds )
                      );
             writef( "For grid{1,2,3}[%4i,%4i] the tested forall sum-product took %12i [us] excl. fillRandom()-ops\n",
                      size,
                      size,
                      t2.elapsed( TimeUnits.microseconds )
                      );
///////////////////////////////////////////////////////////////////////////////////
t1.clear();
t2.clear();

const D3 = {1..size, 1..size, 1..size};

t2.start(); //====================================[3]

forall (i,j,k) in D3 do
  grid3[i,j] += grid1[i,k] * grid2[k,j];

t2.stop(); //=====================================[3]
             writef( "For grid{1,2,3}[%4i,%4i] the Vass-proposed sum-product took %12i [us] excl. fillRandom()-ops\n",
                      size,
                      size,
                      t2.elapsed( TimeUnits.microseconds )
                      );
//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\//\\
//       TiO.run platform uses   1 numLocales, having   2 physical CPU-cores accessible (numPU-s) with   2 maxTaskPar parallelism limit
writef( "TiO.run platform uses %3i numLocales, having %3i physical CPU-cores accessible (numPU-s) with %3i maxTaskPar parallelism limit\n",
                      numLocales,
                      here.numPUs( false, true ),
                      here.maxTaskPar
                      );

@Brad hope the following error message may be of any use for the Chapel team, let me copy it: `chpl: malloc.c:2385: sysmalloc: Assertion "(old_top == initial_top (av) && old_size == 0) || ((unsigned long) (old_size) >= MINSIZE && prev_inuse (old_top) && ((unsigned long) old_end & (pagesize - 1)) == 0)" failed.` ( it appeared during one re-compilation of the code with **`--fast`** and **`--size=64`** ) Chance are the data-type guessed from `config const size = 2` might have collided upon a larger value put into an auto-guessed data-type representation overflows --size=64 making size negative ?) — user3666197, Dec 11 '19 at 13:26

I am trying to improve my run times with Chapel for matrix multiplication

3 Answers3

Performance? Benchmark! ... always, no exceptions, no excuse - surprises not seldom -ccflags -O3

RESULTS:

Performance?
Benchmark!_{... always, no exceptions, no excuse - surprises not seldom -ccflags -O3}