Negative speed up in Amdahl's law?

Question

Amdahl’s law states that a speed up of the entire system is

an_old_time / a_new_time

where the a_new_time can be represented as ( 1 - f ) + f / s’, where f is the fraction of the system that is enhanced by some modification, and s’ is the amount by which that fraction of the system is enhanced. However, after solving this equation for s’, it seems like there are many cases in which s’ is negative, which makes no physical sense.

Taking the case that s = 2 (a 100% increase in the speed for entire system) and f = 0.1 (a 10% of the system is impacted by some speed enhancement s’), we solve for s’ by setting an_old time = 1 and s’ = f / ( f + 1 / s - 1 ).

Plugging on the values for f and s, we find that :
s’ = 0.1 / ( 0.1 + 0.5 - 1 ) = 0.1 / -0.4
which means that the s’ value is negative.

How can this be possible, and what is the physical meaning of this? Also, how can I avoid negative s’ values when answering questions like these?

It's impossible for S = 2 and F = 0.1 because as S` approaches infinity, S approaches 1 / (1 - F), which in this case is 1.111..., so S cannot be larger than that much. — Hadi Brais, Dec 07 '20 at 11:33

score 3 · Accepted Answer · edited Dec 10 '20 at 12:57

Amdahl's Law, also known as Amdahl's argument, is used to find the maximum expected improvement to an overall process when only a part of the process is improved.

               1                 | where S is the maximum theoretical Speedup achievable
S =  __________________________; |       s is the pure-[SERIAL]-section fraction
                ( 1 - s )        | ( 1 - s )  a True-[PARALLEL]-section fraction
     s    +     _________        |       N is the number of processes doing the [PAR.]-part
                    N            |

Due to the algebra, the s + ( 1 - s ) == 1, s being anything from < 0.0 .. 1.0 >, there is no chance to get negative values here.

The full context of the Amdahl's argument
& the contemporary criticism,
_{adding all principal add-on overheads factors
&
a better handling of an atomicity-of-work}

It is often applied in the field of parallel-computing to predict the theoretical maximum speedup achievable by using multiple processors. The law is named after Dr. Gene M. AMDAHL ( IBM Corporation ) and was presented at the AFIPS Spring Joint Computer Conference in 1967.

His paper was extending a prior work, cited by Amdahl himself as "... one of the most thorough analyses of relative computer capabilities currently published ...", published in 1966/Sep by prof. Kenneth E. KNIGHT, Stanford School of Business Administration. The paper keeps a general view on process improvement.

Fig.1:

                                                   a SPEEDUP
                                                     BETWEEN
                                                   a <PROCESS_B>-[SEQ.B]-[PAR.B:N]
 [START]                                             and
    [T0]                                  [T0+tsA] a <PROCESS_A>-[SEQ.A]-ONLY
       |                                         |
       v                                         v
       |                                         |
PROCESS:<SEQ.A>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>|
       |                                         |
       +-----------------------------------------+
       |                                         |
    [T0]         [T0+tsB]             [T0+tsB+tpB] 
       |                |                        |   
       v                v                        v
       |________________|R.0: ____.____.____.____|
       |                |R.1? ____.____|         :
       |                |R.2? ____|    :         :
       |                |R.3? ____|    :         :
       |                |R.4?     :    :         :
       |                |R.5?     :    :         :
       |                |R.6?     :    :         :
       |                |R.7?     :    :         :
       |                |         :    :         :
PROCESS:<SEQ.B>>>>>>>>>>|<PAR.B:4>:    :         :
       |                |<PAR.B:2>:>>>>:         :
                        |<PAR.B:1>:>>>>:>>>>>>>>>: ~~ <PAR.B:1> == [SEQ]
                                  :    :         :
                                  :    :         [FINISH] using 1 PAR-RESOURCE
                                  :    [FINISH]        if using 2 PAR-RESOURCEs
                                  [FINISH]             if using 4 PAR-RESOURCEs

_{( Execution time flows from left to right, from [T0] .. to [T0 + ts1 + tp1].
The sketched order of [SEQ], [PAR] sections was chosen just for illustrative purpose here, can be opposite, in principle, as the process-flow sections' durations ordering is commutative in principle )}

The speedup of a { program | process }, coming from using multiple processors in parallel computing, was derived to be ( maybe to a surprise of audience ) principally limited by the very fraction of time, that was consumed for the non-improved part of the processing, typically the sequential fraction of the program processing, executed still in a pure [SERIAL] process-schedulling manner ( be it due to not being parallelised per-se, or non-parallelisable by nature ).

For example, if a program needs 20 hours using a single processor core, and a particular portion of the program which takes one hour to execute cannot be parallelized ( having been processed in a pure-[SERIAL] process-scheduling manner ) , while the remaining 19 hours (95%) of execution time can be parallelized ( using a true-[PARALLEL] ( not a "just"-[CONCURRENT] ) process-scheduling ), then out of the question the minimum achievable execution time cannot be less than that ( first ) critical one hour, regardless of how many processors are devoted to a parallelized process execution of the rest of this program.

Hence the Speedup achievable is principally limited up to 20x, even if an infinite amount of processors would have been used for the [PARALLEL]-fraction of the process.

See also:
  CRI UNICOS has a useful command amlaw(1) which does simple
  number crunching on Amdahl's Law.
              ------------
On a CRI system type: man amlaw.
                       1         1
     S =  lim    ------------ = ---
          P->oo        1-s       s
                  s +  ---
                        P

 S = speedup which can be achieved with P processors
 s (small sigma) = proportion of a calculation which is serial
 1-s = parallelizable portion
Speedup_overall

= 1 / ( ( 1 - Fraction_enhanced ) + ( Fraction_enhanced / Speedup_enhanced ) )

Articles to parallel@ctc.com (Administrative: bigrigg@ctc.com)
Archive: http://www.hensa.ac.uk/parallel/internet/usenet/comp.parallel

Criticism:

While Amdahl has formulated process-oriented speedup comparison, many educators keep repeating the formula, as if it were postulated for the multiprocessing process rearrangement, without taking into account also the following cardinal issues:

atomicity of processing ( some parts of the processing are not further divisible, even if more processing-resources are available and free to the process-scheduler -- ref. the resources-bound, further indivisible, atomic processing-section in Fig. 1 above )
add-on overheads, that are principally present and associated with any new process creation, scheduler re-distribution thereof, inter-process communication, processing results re-collection and remote-process resources' release and termination ( it's proportional dependence on N is not widely confirmed, ref. Dr. J. L. Gustafson, Jack Dongarra, et el, who claimed approaches with better than linear scaling in N )

Both of these group of factors have to be incorporated in the overhead-strict, resources-aware Amdahl's Law re-formulation, if it ought serve well to compare apples to apples in contemporary parallel-computing realms. Any kind of use of an overhead-naive formula results but in a dogmatic result, which was by far not formulated by Dr. Gene M. Amdahl in his paper ( ref. above ) and comparing apples to oranges have never brought anything positive to any scientific discourse in any rigorous domain.

Overhead-strict re-formulation of the Amdahl's Law speedup S:

               1
S =  __________________________; where s, ( 1 - s ), N were defined above
                ( 1 - s )            pSO:= [PAR]-Setup-Overhead     add-on
     s  + pSO + _________ + pTO      pTO:= [PAR]-Terminate-Overhead add-on
                    N

Overhead-strict and resources-aware re-formulation:

                           1                         where s, ( 1 - s ), N
S =  ______________________________________________ ;      pSO, pTO
                    / ( 1 - s )           \                were defined above
     s  + pSO + max|  _________ , atomicP  |  + pTO        atomicP:= further indivisible duration of atomic-process-block
                    \     N               /

Interactive Tool for a maximum effective speedup :

Due to reasons described above, one picture might be worth million words here. Try this, where a fully interactive tool for using the overhead-strict Amdahl's Law is cross-linked.

Another possible problem with applying Amdah's law: caches and pipelined OoO exec mean that optimizations to one step of a process can impact other parts running later on the same CPU. (Not to mention others running at the same time, competing for shared resources like memory bandwidth). So applying Amdahl's law to profiler output (showing time spent per function) might lead you to believe that speeding up one function a lot would leave other function times unchanged. Not true if the speedup involves touching more memory, or if the function is so short than OoO exec is significant. — Peter Cordes, Dec 10 '20 at 14:21
@PeterCordes ***obviously so,*** these (non-extensible) side-effects of the process-execution (most often NUMA) ecosystems are, IMHO, a core reason for *an evangelisation* of the so called *superlinear speedup* in test-cases that are not a level playfield (so finally comparing uncomparable, ***arguing** apples to oranges*). If cache "improve" re-uses of data (not needed to mem-I/O 'em from a NUMA RAM-storage at TimeDOMAIN costs of less than 1/300) that's, no doubt, good from a point of view of HPC-budgets, yet such views do not justify not having the same "advantage" in comparing [SERIAL] runs — user3666197, Dec 10 '20 at 15:37
@PeterCordes the beauty of the Amdahl's argument is in the interpretation known as ***The Law of Diminishing Returns*** - no matter how many more processing entities we add, there are less and less amounts of performance / productivity / economy of returns "improvement" justification for doing so. The Amdahl's **glass ceiling** is PRINCIPAL, no matter if we try to scale-up some machine-code or a flow of mass-vaccination, orchestrated on the shopping-mall parking arena or a flow of mass-production of nano-materials' respirators on a few, more or many masks production lines. **Amdahl rules...** — user3666197, Dec 10 '20 at 15:44
@PeterCordes The OoO execution is not an argument per-se, it is a known trick ( only on some CPU micro-architectures, where the silicon-design was driven into that direction ) for **Hardware latency-masking**, be it for a reason of the known design's constraints of memory-I/O access-times, or for a reason of having only a limited number of ALU-s available or not to be used at that particular moment of the CPU-uop in the pipeline, so the OoO execution trick is a way to circumvent (and for sure only to some limited extent) these barriers (knowingly not mentioning the core thermal & power limits) — user3666197, Dec 10 '20 at 15:51
The fact remains that Amdahl's law is formulated on the assumption that the total time is the sum of times for steps, and changing one step doesn't change times for other steps. This isn't always true, especially when your "steps" are very fine-grained. e.g. for separate asm instructions, there's no single "cost" for a given instruction you can add up to get total cost for a program, except on very old simple CPUs before caches and pipelining. — Peter Cordes, Dec 10 '20 at 17:33
@PeterCordes ***obviously so***, the Amdahl's argument is a **_system_level_** organisation principle, where adding more parallel or improved resources to execute the whole *System_under_Test (SuT)* has the principal "economy" limit in said law of diminishing returns. Can keep adding more & more arguments, on OoO execution, on compiler-injected machine-code substitions (AVX & alike tricks, that circumvent or skew the CPU-uops actually put into Fetch-Decode-Execute pipelines, so any attempt to try testing micro-scale effects at the CPU-uops level goes into apples-to-oranges in instruction mixes — user3666197, Dec 10 '20 at 19:20

Negative speed up in Amdahl's law?

1 Answers1

The full context of the Amdahl's argument
& the contemporary criticism,
_{adding all principal add-on overheads factors
&
a better handling of an atomicity-of-work}

Fig.1:

Criticism:

Overhead-strict re-formulation of the Amdahl's Law speedup S:

Overhead-strict and resources-aware re-formulation:

Interactive Tool for a maximum effective speedup :

Linked

Negative speed up in Amdahl's law?

1 Answers1

The full context of the Amdahl's argument& the contemporary criticism,adding all principal add-on overheads factors&a better handling of an atomicity-of-work

Fig.1:

Criticism:

Overhead-strict re-formulation of the Amdahl's Law speedup S:

Overhead-strict and resources-aware re-formulation:

Interactive Tool for a maximum effective speedup :

Linked

The full context of the Amdahl's argument
& the contemporary criticism,
_{adding all principal add-on overheads factors
&
a better handling of an atomicity-of-work}