Efficient treatment of tuples as fixed-size vectors

Question

In Chapel, homogeneous tuples can be used as if they were small "vectors"
( e.g., a = b + c * 3.0 + 5.0; ).

However, because various math functions are not provided for tuples, I have tried writing a function for norm() in several ways and compared their performance. My code is something like this:

proc norm_3tuple( x: 3*real ): real
{
    return sqrt( x[1]**2 + x[2]**2 + x[3]**2 );
}

proc norm_loop( x ): real
{
    var tmp = 0.0;
    for i in 1 .. x.size do
        tmp += x[i]**2;
    return sqrt( tmp );
}

proc norm_loop_param( x ): real
{
    var tmp = 0.0;
    for param i in 1 .. x.size do
        tmp += x[i]**2;
    return sqrt( tmp );
}

proc norm_reduce( x ): real
{
    var tmp = ( + reduce x**2 );
    return sqrt( tmp );
}

//.........................................................

var a = ( 1.0, 2.0, 3.0 );

// consistency check
writeln( norm_3tuple(     a ) );
writeln( norm_loop(       a ) );
writeln( norm_loop_param( a ) );
writeln( norm_reduce(     a ) );

config const nloops = 100000000;  // 1E+8

var res = 0.0;
for k in 1 .. nloops
{
    a[ 1 ] = (k % 5): real;

    res += norm_3tuple(     a );
 // res += norm_loop(       a );
 // res += norm_loop_param( a );
 // res += norm_reduce(     a );
}

writeln( "result = ", res );

I compiled the above code with chpl --fast test.chpl (Chapel v1.16 on OSX10.11 with 4 cores, installed via homebrew). Then, norm_3tuple(), norm_loop(), and norm_loop_param() gave almost the same speed (0.45 sec), while norm_reduce() was much slower (about 30 sec). I checked the output of top command, and then norm_reduce() was using all 4 cores, while other functions use only 1 core. So my question is...

Is norm_reduce() slow because reduce works in parallel and the overhead for parallel execution is much greater than the net computational cost for this small tuple?
Given that we want to avoid reduce for 3-tuples, the other three routines run essentially with the same speed. Does this mean that explicit for-loops have negligible cost for 3-tuples (e.g., via loop unrolling enabled by --fast option)?
In norm_loop_param(), I have also tried using param keyword for the loop variable, but this gave me little or no performance gain. If we are interested in homogeneous tuples only, is it not necessary to attach param at all (for performance)?

I'm sorry for many questions at once, and I would appreciate any advice/suggestions for efficient treatment of small tuples. Thanks very much!

when you get your `norm()` function fixed, would you consider contributing it to [NumSuch](https://github.com/buddha314/numsuch) I am collecting mathematical libraries for Chapel — Brian Dolan, Nov 10 '17 at 22:18
Hi, thanks very much for pointing to your project (the goal of which seems very big... :-) I will try making contributions after getting some more practical experience (now still doing various trial-and-error to get good performance for my use cases...) — roygvib, Nov 11 '17 at 01:28

score 2 · Accepted Answer · edited Nov 11 '17 at 15:43

2

Is norm_reduce() slow because reduce works in parallel and the overhead for parallel execution is much greater than the net computational cost for this small tuple?

I believe you are correct that this is what's going on. Reductions are executed in parallel, and Chapel currently doesn't attempt to do any intelligent throttling to squash this parallelism when the work may not warrant it (as in this case), so I think you're suffering from too much task overhead to do almost no work other than coordinating with the other tasks (though I am surprised that the magnitude of the difference is so large... but I also find I have little intuition for such things). In the future, we'd hope that the compiler would serialize such small reductions in order to avoid these overheads.

Given that we want to avoid reduce for 3-tuples, the other three routines run essentially with the same speed. Does this mean that explicit for-loops have negligible cost for 3-tuples (e.g., via loop unrolling enabled by --fast option)?

The Chapel compiler doesn't unroll the explicit for loop in norm_loop() (and you can verify this by inspecting the code generated with the --savec flag), but it could be that the back-end compiler is. Or that the for-loop really doesn't cost that much compared to the unrolled loop of norm_loop_param(). I suspect you'd need to inspect the generated assembly to determine which is the case. But I also expect that back-end C compilers would do decently with the code we generate -- e.g., it's easy for it to see that it's a 3-iteration loop.

In norm_loop_param(), I have also tried using param keyword for the loop variable, but this gave me little or no performance gain. If we are interested in homogeneous tuples only, is it not necessary to attach param at all (for performance)?

This is hard to give a definitive answer to since I think it's mostly a question about how good the back-end C compiler is.

edited Nov 11 '17 at 15:43

user3666197

1
6
50
92

answered Nov 11 '17 at 04:38

Brad

3,839
7
25

Thanks very much for various info. Because I'm used to Fortran programming, I wanted to use whole-array notation (where possible) to avoid explicit loops (e.g., sum(x**2)), which is as efficient as explicit loops. However, the pitfall here is that whole-array operations and reductions in Chapel are always done in parallel, which necessarily lead to substantial overhead. So I guess it would be one of "performance tips" for those coming from Fortran to avoid ".." (or reduction) for short section of large data ... :-) – roygvib Nov 11 '17 at 20:11
1

At present, I think that's reasonable advice. Examples like this could be good motivation for us to do a better job of de-parallelizing such cases. They also relate, somewhat, to work we have planned related to improving vectorization. – Brad Nov 11 '17 at 20:24
@Brad Would you mind Brad to have a look and perhaps explain the **adverse effects** of about **2x slower code-execution** once **`--fast`** was used? ( A ready to re-run reproducible testing code is linked below. Thanks. ) – user3666197 Nov 11 '17 at 22:52
@user3666197 : When I run the code from your TIO link on my local workstation, I see execution times of similar magnitude whether or not ``--fast`` is used. More importantly, I see variations of > 10x from run to run. That suggests to me that the code in question is simply too short to time in this manner and draw conclusions from. IIRC, the TIO machine is a shared resource and therefore may not be the best system to do time-based experiments on... – Brad Nov 11 '17 at 23:18
@user3666197: Also, note that calling `start()`, `stop()`, `elapsed()`, `start()`, `stop()`, `elapsed()` on a single timer variable does not reset it, but causes it to continue on from where it left off. The paradigm to think of is starting then stopping a stopwatch, reading it, then pressing start again. If you want a fresh timing, you should clear it between trials. This likely doesn't affect your experiment (since the shared timer was used for the two fast serial cases), but could affect future ones. – Brad Nov 11 '17 at 23:22
@Brad -- There was no intention to let any timer instance run more than one segment of a test ( **no "shared" timing** )... yet I admit that the manual edit-revisions ( using multi-line comment-ed section ) mechanical test-selections are neither anything high-tech for use, nor anything pleasant to read and re-test ( just saved a state-full IDE for practical reasons ). As per the **`--fast`** issue ( sure, I agree both on noise, and the shared-platform ( that was the very reason to re-run tests ~ 10x under the same setup, so as to observe the potential noise impact on an artifact under test ) ) – user3666197 Nov 11 '17 at 23:38
@Brad The **`--fast`** strange effects on performance ( both the adverse ~2x slower or an almost neutral ~1x still being an issue ) remain rather strange, even in cases of 1E+4 times re-runs in loop-scaled experiments ( Ref. the **Section: "SCALING"** but *not* the **Section: "A third surprise..."** below ), where a platform / OS noise is not the root cause for the observed performance and thus seem remain to be a yet to be inspected / explained performance-issue. – user3666197 Nov 11 '17 at 23:51
@user3666197: I could not tell what program you were running in the "Scaling" section. The code I saw that timed different loop styles didn't have a way of controlling the number of trials, and the code that I saw which permitted the number of trials to be controlled seemed to only test one loop style (?). – Brad Nov 12 '17 at 14:54
Actually the code was linked ( though a further evolved TiO.IDE-state, inside the "A third Surprise" section below ), thanks for a remark, Brad, the cleaned code was linked directly from "Scaling" for a full reference. Anyway, enjoy the Sunday! – user3666197 Nov 12 '17 at 16:05
Thanks, I see the code in question now. When I run this on my workstation, I see --fast outperforming default compilations as I'd expect: normal: [LOOP] norm_3tuple(): 1.66564e+06 [us] -- result = 4.30918e+08 @ 100000000 loops. fast: [LOOP] norm_3tuple(): 9.98234e+05 [us] -- result = 4.30918e+08 @ 100000000 loops. – Brad Nov 13 '17 at 05:30

user3666197 · Answer 2 · 2017-12-08T18:10:07.490

_{Ex-post remark: actually there was a third remarkable performance surprise at the end...}

Performance?
Benchmark!_{... always, no exceptions, no excuse}

This is what makes chapel so great. Thanks a lot the Chapel team for developing and improving such great computing tool for the HPC over more than the last decade.

With a full love in true-[PARALLEL] efforts, the performance is always a result of both the design practices and underlying system hardware, never a just syntax-constructor granted "bonus".

The norm_reduce() processing systematically spends several milliseconds just to setup all the concurrency-enabled reduce computing facilities to later just generate and return a single x**2 product to the queue of results for a deferred central +-reductor-engine summation. Pretty a lot of overheads for a single 2 CLK CPU uops, isn't it?

For reasons why, one may kindly review the costs of process-scheduling details and my updated criticism of Amdahl's Law original formulation.

Code Benchmarking - has delivered actually two surprises at once:

+++++++++++++++++++++++++++++++++++++++++++++++ <TiO.IDE>.RUN
                                        3.74166
[SEQ]       norm_loop():    0.0 [us] -- 3.74166
[SEQ] norm_loop_param():    0.0 [us] -- 3.74166
[PAR]:    norm_reduce(): 5677.0 [us] -- 3.74166

                                        3.74166
[SEQ]       norm_loop():    0.0 [us] -- 3.74166
[SEQ] norm_loop_param():    1.0 [us] -- 3.74166
[PAR]:    norm_reduce(): 5818.0 [us] -- 3.74166

                                        3.74166
[SEQ]       norm_loop():    1.0 [us] -- 3.74166
[SEQ] norm_loop_param():    2.0 [us] -- 3.74166
[PAR]:    norm_reduce(): 4886.0 [us] -- 3.74166

The first was reported in the original post, the second was observed after the Chapel runs were equipped with a --fast compiler switch:

+++++++++++++++++++++++++++++++++++++++++++++++ <TiO.IDE>.+CompilerFLAG( "--fast" ).RUN
                                        3.74166
[SEQ]       norm_loop():    1.0 [us] -- 3.74166
[SEQ] norm_loop_param():    2.0 [us] -- 3.74166
[PAR]:    norm_reduce(): 7769.0 [us] -- 3.74166

                                        3.74166
[SEQ]       norm_loop():    0.0 [us] -- 3.74166
[SEQ] norm_loop_param():    0.0 [us] -- 3.74166
[PAR]:    norm_reduce(): 9109.0 [us] -- 3.74166

                                        3.74166
[SEQ]       norm_loop():    1.0 [us] -- 3.74166
[SEQ] norm_loop_param():    1.0 [us] -- 3.74166
[PAR]:    norm_reduce(): 8807.0 [us] -- 3.74166

As always, SuperComputing2017 HPC promotes [ Reproducibility ] for every aspect published in Technical Papers or benchmarking tests.

These results were collected on Try-it-Online sponsored chapel online platform, and all interested enthusiasts are welcome to re-run and post their localhost / cluster operated performance details of the Chapel-code, so as to better document the hardware-system dependent variability of the above observed times ( for further experimentation with a ready-to-run timing decorated code, may use this link to a state-full snapshot of the TiO.IDE ).

/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ use Time;
/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ var aStopWATCH_SEQ: Timer;
/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ var aStopWATCH_PAR: Timer;

proc norm_3tuple( x: 3*real ): real
{
    return sqrt( x[1]**2 + x[2]**2 + x[3]**2 );
}

proc norm_loop( x ): real
{
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_SEQ.start();
    var tmp = 0.0;
    for i in 1 .. x.size do
        tmp += x[i]**2;
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_SEQ.stop(); write(                          "[SEQ]       norm_loop(): ",
                                                                       aStopWATCH_SEQ.elapsed( Time.TimeUnits.microseconds ), " [us] -- " );
    return sqrt( tmp );
}

proc norm_loop_param( x ): real
{
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_SEQ.start();
    var tmp = 0.0;
    for param i in 1 .. x.size do
        tmp += x[i]**2;
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_SEQ.stop(); write(                          "[SEQ] norm_loop_param(): ",
                                                                       aStopWATCH_SEQ.elapsed( Time.TimeUnits.microseconds ), " [us] -- " );
    return sqrt( tmp );
}

proc norm_reduce( x ): real
{
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_PAR.start();
    var tmp = ( + reduce x**2 );
/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_PAR.stop(); write(                          "[PAR]:    norm_reduce(): ",
                                                                       aStopWATCH_PAR.elapsed( Time.TimeUnits.microseconds ), " [us] -- " );
    return sqrt( tmp );
}

//.........................................................

var a = ( 1.0, 2.0, 3.0 );

// consistency check
writeln( norm_3tuple(     a ) );
writeln( norm_loop(       a ) );
writeln( norm_loop_param( a ) );
writeln( norm_reduce(     a ) );

Scaling:

 [LOOP] norm_3tuple():       45829.0 [us] -- result = 4.30918e+06 @   1000000 loops.
 [LOOP] norm_3tuple():      241680   [us] -- result = 4.30918e+07 @  10000000 loops.
 [LOOP] norm_3tuple():     2387080   [us] -- result = 4.30918e+08 @ 100000000 loops.

[LOOP]  norm_loop():         72160.0 [us] -- result = 4.30918e+06 @   1000000 loops.
[LOOP]  norm_loop():        755959   [us] -- result = 4.30918e+07 @  10000000 loops.
[LOOP]  norm_loop():       7783740   [us] -- result = 4.30918e+08 @ 100000000 loops.

[LOOP]  norm_loop_param():   34102.0 [us] -- result = 4.30918e+06 @   1000000 loops.
[LOOP]  norm_loop_param():  365510   [us] -- result = 4.30918e+07 @  10000000 loops.
[LOOP]  norm_loop_param(): 3480310   [us] -- result = 4.30918e+08 @ 100000000 loops.

-------------------------------------------------------------------------1000--------{--fast}---------------------------------------------------------------------
[LOOP]  norm_reduce():     5851380   [us] -- result = 4309.18     @      1000 loops.
[LOOP]  norm_reduce():     5884600   [us] -- result = 4309.18     @      1000 loops.
[LOOP]  norm_reduce():     6163690   [us] -- result = 4309.18     @      1000 loops.
[LOOP]  norm_reduce():     6029860   [us] -- result = 4309.18     @      1000 loops.
[LOOP]  norm_reduce():     6083730   [us] -- result = 4309.18     @      1000 loops.
[LOOP]  norm_reduce():     6132720   [us] -- result = 4309.18     @      1000 loops.
[LOOP]  norm_reduce():     6012620   [us] -- result = 4309.18     @      1000 loops.
[LOOP]  norm_reduce():     6379020   [us] -- result = 4309.18     @      1000 loops.
[LOOP]  norm_reduce():     5923550   [us] -- result = 4309.18     @      1000 loops.
[LOOP]  norm_reduce():     6144660   [us] -- result = 4309.18     @      1000 loops.
[LOOP]  norm_reduce():     8098380   [us] -- result = 4309.18     @      1000 loops. [--fast]
[LOOP]  norm_reduce():     6215470   [us] -- result = 4309.18     @      1000 loops. [--fast]
[LOOP]  norm_reduce():     5831670   [us] -- result = 4309.18     @      1000 loops. [--fast]
[LOOP]  norm_reduce():     6124580   [us] -- result = 4309.18     @      1000 loops. [--fast]
[LOOP]  norm_reduce():     6092740   [us] -- result = 4309.18     @      1000 loops. [--fast]
[LOOP]  norm_reduce():     5811260   [us] -- result = 4309.18     @      1000 loops. [--fast]
[LOOP]  norm_reduce():     5880400   [us] -- result = 4309.18     @      1000 loops. [--fast]
[LOOP]  norm_reduce():     5898520   [us] -- result = 4309.18     @      1000 loops. [--fast]
[LOOP]  norm_reduce():     6591110   [us] -- result = 4309.18     @      1000 loops. [--fast]
[LOOP]  norm_reduce():     5876570   [us] -- result = 4309.18     @      1000 loops. [--fast]
[LOOP]  norm_reduce():     6034180   [us] -- result = 4309.18     @      1000 loops. [--fast]


-------------------------------------------------------------------------2000--------{--fast}---------------------------------------------------------------------
[LOOP]  norm_reduce():    12434700   [us] -- result = 8618.36     @      2000 loops.


-------------------------------------------------------------------------3000--------{--fast}---------------------------------------------------------------------
[LOOP]  norm_reduce():    17807600   [us] -- result = 12927.5     @      3000 loops.


-------------------------------------------------------------------------4000--------{--fast}---------------------------------------------------------------------
[LOOP]  norm_reduce():    23844300   [us] -- result = 17236.7     @      4000 loops.


-------------------------------------------------------------------------5000--------{--fast}---------------------------------------------------------------------
[LOOP]  norm_reduce():    30557700   [us] -- result = 21545.9     @      5000 loops.
[LOOP]  norm_reduce():    30523700   [us] -- result = 21545.9     @      5000 loops.
[LOOP]  norm_reduce():    29404200   [us] -- result = 21545.9     @      5000 loops.
[LOOP]  norm_reduce():    29268600   [us] -- result = 21545.9     @      5000 loops. [--fast]
[LOOP]  norm_reduce():    29009500   [us] -- result = 21545.9     @      5000 loops. [--fast]
[LOOP]  norm_reduce():    30388800   [us] -- result = 21545.9     @      5000 loops. [--fast]


-------------------------------------------------------------------------6000--------{--fast}---------------------------------------------------------------------
[LOOP]  norm_reduce():    37070600   [us] -- result = 25855.1     @      6000 loops.


-------------------------------------------------------------------------7000--------{--fast}---------------------------------------------------------------------
[LOOP]  norm_reduce():    42789200   [us] -- result = 30164.3     @      7000 loops.


---------------------------------------------------------------------8000--------{--fast}---------------------------------------------------------------------
[LOOP]  norm_reduce():    50572700   [us] -- result = 34473.4     @      8000 loops.
[LOOP]  norm_reduce():    49944300   [us] -- result = 34473.4     @      8000 loops.
[LOOP]  norm_reduce():    49365600   [us] -- result = 34473.4     @      8000 loops.
[LOOP]  norm_reduce():   ~60+                                                                 // exceeded the 60 seconds limit and was terminated [Exit code: 124]
[LOOP]  norm_reduce():    50099900   [us] -- result = 34473.4     @      8000 loops.
[LOOP]  norm_reduce():    49445500   [us] -- result = 34473.4     @      8000 loops.
[LOOP]  norm_reduce():    49783800   [us] -- result = 34473.4     @      8000 loops.
[LOOP]  norm_reduce():    48533400   [us] -- result = 34473.4     @      8000 loops.
[LOOP]  norm_reduce():    48966600   [us] -- result = 34473.4     @      8000 loops.
[LOOP]  norm_reduce():    47564700   [us] -- result = 34473.4     @      8000 loops.
[LOOP]  norm_reduce():    47087400   [us] -- result = 34473.4     @      8000 loops.
[LOOP]  norm_reduce():    47624300   [us] -- result = 34473.4     @      8000 loops. [--fast]
[LOOP]  norm_reduce():   ~60+                                                        [--fast] // exceeded the 60 seconds limit and was terminated [Exit code: 124]
[LOOP]  norm_reduce():   ~60+                                                        [--fast] // exceeded the 60 seconds limit and was terminated [Exit code: 124]
[LOOP]  norm_reduce():    46887700   [us] -- result = 34473.4     @      8000 loops. [--fast]
[LOOP]  norm_reduce():    46571800   [us] -- result = 34473.4     @      8000 loops. [--fast]
[LOOP]  norm_reduce():    46794700   [us] -- result = 34473.4     @      8000 loops. [--fast]
[LOOP]  norm_reduce():    46862600   [us] -- result = 34473.4     @      8000 loops. [--fast]
[LOOP]  norm_reduce():    47348700   [us] -- result = 34473.4     @      8000 loops. [--fast]
[LOOP]  norm_reduce():    46669500   [us] -- result = 34473.4     @      8000 loops. [--fast]

A third surprise appeared - from going into a `forall do { ... }`:

While the [SEQ]-nloops-ed code was awfully devastated from the associated add-on overheads, a slight problem re-formulation has shown a very different performance levels achievable even on a single-CPU platform ( the more should the performance gain on multi-CPU code-execution ) and the very effect the --fast compiler-switch has generated here:

/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ use Time;
/* ---------------------------------------SETUP-SECTION-UNDER-TEST--*/ var aStopWATCH_LOOP: Timer;

config const nloops = 100000000;  // 1E+8    
       var   res: atomic real;
             res.write( 0.0 );
//------------------------------------------------------------------// PRE-COMPUTE:
var A1:    [1 .. nloops] real;                                      // pre-compute a tuple-element value
forall k in 1 .. nloops do                                          // pre-compute a tuple-element value
    A1[k] = (k % 5): real;                                          // pre-compute a tuple-element value to a ( k % 5 ), ex-post typecast to real

/* ---------------------------------------------SECTION-UNDER-TEST--*/  aStopWATCH_LOOP.start();
forall i in 1 .. nloops do
{               //  a[1] = (  i % 5 ): real;                        // pre-compute'd
   res.add( norm_reduce( ( A1[i],            a[1], a[2] ) ) );      //     atomic.add()
// res +=   norm_reduce( ( (  i % 5 ): real, a[1], a[2] ) );        // non-atomic
//:49: note: The shadow variable 'res' is constant due to forall intents in this loop

}/* ---------------------------------------------SECTION-UNDER-TEST--*/ aStopWATCH_LOOP.stop(); write(
  "forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: ",     aStopWATCH_LOOP.elapsed( Time.TimeUnits.microseconds ), " [us] -- " );
/* 
   --------------------------------------------------------------------------------------------------------{-nloops-}-------{--fast}-------------
   forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }:     7911.0 [us] -- result =     320.196 @       100 loops. 
   forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }:     8055.0 [us] -- result =    3201.96  @      1000 loops.
   forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }:     8002.0 [us] -- result =   32019.6   @     10000 loops.
   forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }:    80685.0 [us] -- result = 3.20196e+05 @    100000 loops.
   forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }:   842948   [us] -- result = 3.20196e+06 @   1000000 loops.
   forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }:  8005300   [us] -- result = 3.20196e+07 @  10000000 loops.
   forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 40358900   [us] -- result = 1.60098e+08 @  50000000 loops.
   forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }: 40671200   [us] -- result = 1.60098e+08 @  50000000 loops.

   forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }:  2195000   [us] -- result = 1.60098e+08 @  50000000 loops. [--fast]

   forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }:  4518790   [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
   forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }:  6178440   [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
   forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }:  4755940   [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
   forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }:  4405480   [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
   forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }:  4509170   [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
   forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }:  4736110   [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
   forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }:  4653610   [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
   forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }:  4397990   [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
   forall .. do { res.add( norm_reduce( aPreComputedTUPLE ) ) }:  4655240   [us] -- result = 3.20196e+08 @ 100000000 loops. [--fast]
  */

Thanks very much for trying timing comparison on another environment (TIO). I didn't expect the overhead (for short data) to be so large, but it is probably natural because parallel equipments in Chapel is designed for large-scale calculations... But to make the coding more convenient, I guess it may be useful to have a separate, single-locale "light-weight" sum() + whole-array notation that essentially transforms desired array operations into explicit loops (in a way similar to sum() in other serial array languages). — roygvib, Nov 11 '17 at 20:21
You may notice the overhead is **not** related to locale-based parallel computing, but **related to an almost constant add-on process-setup overhead, introduced by the `( + reduce ... )`** syntax constructor, where these indeed immense **costs are devastatingly high to become justified by processing** ( even if any acceleration would help, whereas there is actually none such ) **just a single `x ** 2` operation.** [BTW] have you tried to re-run these tests on your 4-CPU localhost so as to also compare the localhost O/S noise levels? — user3666197, Nov 11 '17 at 21:29

Efficient treatment of tuples as fixed-size vectors

2 Answers2

Performance?
Benchmark!_{... always, no exceptions, no excuse}

Code Benchmarking - has delivered actually two surprises at once:

Scaling:

A third surprise appeared - from going into a `forall do { ... }`:

Linked

Efficient treatment of tuples as fixed-size vectors

2 Answers2

Performance? Benchmark! ... always, no exceptions, no excuse

Code Benchmarking - has delivered actually two surprises at once:

Scaling:

A third surprise appeared - from going into a forall do { ... }:

Linked

Performance?
Benchmark!_{... always, no exceptions, no excuse}

A third surprise appeared - from going into a `forall do { ... }`: