2

I have systems that have a large number of cores as well as a cluster. For a particular task for which no serial implementation is available, I can only benchmark w.r.t. time taken for tasks running on different input sizes. I see that even when data size was increased by a factor of 10 times, the time for completion is less than 10 times while using identical resources. I would like to know how to measure the performance, as this does not appear to fall under typical definitions of strong/weak scaling. This appears to be related to efficiency, but I am not certain. From what I could gather about the three:

  1. Strong scaling (Amdhal's law): speedup = 1 / ( s + p / N ) = T( 1 ) / T( N )
  2. Weak scaling (Gustafson’s law): scaled speedup = s + p × N
  3. Efficiency: speedup / N

As I don't have speedup due to lack of serial implementation and that N a is constant, I can only think of finding ratios of efficiencies using strong scaling. Is such a parameter used in CS?

user3666197
  • 1
  • 6
  • 50
  • 92
Quiescent
  • 1,088
  • 7
  • 18
  • 2
    One can always "orchestrate" a flow of process-execution, so as to reach a pure-`[SERIAL]` scheduling -- i.e. `one`-after-`another`-after-`yet-another`-... ( even, if needed, using an explicit, ad-hoc added inter-platform SIG/SYNC signalling (knowing, if non-negligible due to small scales of the "usefull work" of the System-under-Test, its injected "*add-on costs*" in `[ns]` ). Anyway, trying to imagine a SuT, where being LARGER introduces FASTER processing sound interesting. Would you add more details on both process and platform? Single instruction QPU ( QUBO Quantum annealer ) is one such ⚛ – user3666197 Jan 12 '22 at 14:21
  • Apache Spark on workloads on 250-500 GB data. B/M was done with 100% and 10% data sets. Jobs run between 250-3000s depending on the type and size. I can force number of executors to be 1 with 1 executor core, but that would be wrong as theoretically only optimum serial job should be written. – Quiescent Jan 12 '22 at 15:47

1 Answers1

1

Apache Spark on workloads on 250-500 GB data. B/M was done with 100% and 10% data sets. Jobs run between 250-3000s depending on the type and size. I can force number of executors to be 1 with 1 executor core, but that would be wrong as theoretically only optimum serial job should be written.
Quiescent 24 mins ago

( URL added )
enter image description here

Thanks for this note. The problem gets ground to answer it :

Q :... "Is such a parameter used in CS ?"

The answer to the questions about the observations on the above depicted problem has nothing to do with DATA-size per-se, the DATA-sizing is important, yet the core understanding is related to the internal functioning of the where overheads matter :

SMALL RDD-DATA 

      +-------------------E-2-E ( RDD/DAG Spark-wide distribution
      |s+------+o         |                        & recollection
      |e|      | v       s|               Turn-Around-Time )
      |t| DATA |  e     d |
      |u|1x    |   r   a  |
      |p+------+    h e   |
      +-------------------+
      |                   |
      |                   |
      |123456789.123456789|

Whereas :

LARGER RDD-DATA

      +--------:------:------:------:-------------------E-2-E ( RDD/DAG Spark-wide TAT )
      |s+------:------:------:------:------+o         + |
      |e|      :      :      :      :      | v       s v|
      |t| DATA : DATA : DATA : DATA : DATA |  e     d  a|
      |u|1x    :2x    :3x    :4x    :5x    |   r   a   r|
      |p+------:------:------:------:------+    h e    .|
      +--------:------:------:------:-------------------+
      |                                                 |
      |                   |                             |
      |123456789.123456789|                             |
      |                                                 |
      |123456789.123456789.123456789.123456789.123456789|

( not a multiple of 5x the originally observed E-2-E for "small" DATA ( Spark-wide TAT )
  yet a ( Setup & Termination overheads stay about same ~ const. )
      a ( a DATA-size variable part need-not yet may grow )
  now
      show an E-2-E of about ~ 50 TimeUNITs for 5-times more DATA,
      that is
  for obvious
      reasons not 5-times ~ 20 TimeUNITs
           as was seen
              during the E-2-E TAT from processing in "small"-DATA use-case
           as not
              all system-wide overheads accumulation
                                  scale with DATA size

For further reading on Amdahl's argument & Gustafson/Barsis promoted scaling, feel free to continue here.

user3666197
  • 1
  • 6
  • 50
  • 92
  • Thank you @user3666197, for the detailed response. I am aware of the overhead issue. Indeed, in some other calculations I observed that as data size increases, the scalability comes close to linear from the otherwise super-linear scalability on smaller data sizes. As Spark does lazy computation and simultaneous operations that involves probable parallel i/o operations with stage aware scheduling, it might be difficult to pin-down the individual contributions. What I need here, and looking for, is a measure to discuss the performance and to demonstrate that the algorithm is scalable. – Quiescent Jan 12 '22 at 17:45
  • 1
    The ALAP-trick with the ability to completely exclude massive parts of the computing, if on "later" & "upper" level of the processing some parts of the (otherwise distributed) processing need not take place at all, IMHO, collapses our chances to compare apples to apples if trying to do some scale-related comparisons. If we cannot "glue-together" each and every resource ( naturally distributed in Spark-ecosystem ) in their kind & footprint, we simply have no way to "run" the "same" problem in other, non-distributed resources-orchestration ( so having no fair baseline to next measure against ) – user3666197 Jan 12 '22 at 17:58
  • This is a common "lie" ( a knowingly skewed argumentation ) about "*superlinearity*", because there is no monolytic CPU with that many cores, with that many & that fast L1 / L2 / L3 - cache-hierarchy on each of 'em, with that many mem-I/O hardware channels working in a hardware-parallel fashion, as are physically present at the whole footprint of the SuperComputer (which gets benchmarked against), so we can only compare apples to oranges ( observing only side-effects of otherwise unfair advantages --as was not 1:1 in scale & composition replicated in a fair way on serial processing hardware ) – user3666197 Jan 12 '22 at 18:04
  • 1
    Agreed, ALAP is a black box with surprises that sometimes are not welcome. In this case the computation is embarrassingly parallel with no data shuffling; caching was both enabled and guaranteed with the given resources. I was anyway not going to emphasize on 'superlinearity' as that was not the focus, and I noticed that even starting of Yarn process takes some amount of time that is not negligible, esp. when compared with the times of completion of smaller jobs. :) – Quiescent Jan 12 '22 at 18:27