1

I have the following question in my assignment. I know that I need to use Amdahl's law but I don't know which part is going to be which part in the formula.

Here is the question:

How much will the following code speed up if we run it simultaneously on 8 threads?

#include <stdio.h> 
#include <omp.h> //OpenMP library 

int main()  {    int i=0,j=0; 

  for (i=0;i<1000;i++){
    i*i;   } 

  #pragma omp parallel for 

  for (j=0;j<2000;j++){ 
    j*j;   } 

  return 0;  }  ```

Any help is appreciated!

user229044
  • 232,980
  • 40
  • 330
  • 338
  • I would expect it to be massively slower. The code does nothing and could probably be optimized away to such an extent that it would be swamped by overheads::( – Martin James Dec 04 '21 at 07:38
  • I mean, even if the loops were not optimized away, 3000 multiplications is just too trivial to try to run in parallel. – Martin James Dec 04 '21 at 07:41

2 Answers2

1

Amdahl's Law is like so:

Amdahl's Law

s is the speedup of the part of the task that benefits from improved system resources and p is the proportion of execution time that the part benefiting from improved resources originally occupied.

Your program runs in time proportional to 3000, the total number of loop iterations if the program is run in serial.

The part of the program you wish to parallelize accounts for 2/3 of the run time. Since you plan to run it on 8 threads your theoretical speed-up for that part of the program is 8x.

We plug numbers in to the equation:

1/((1-2/3) + (2/3)/8)

This indicates a 2.4x bound on the speed-up.

Richard
  • 56,349
  • 34
  • 180
  • 251
  • You might like to read the original Gene M. AMDAHL's paper, to see the actual argument ( https://web.archive.org/web/20201031200446/http://wotug.org/parallel/internet/usenet/comp.parallel/FAQ/ -- section "FAQ part 20: IBM and Amdahl", where the paper is on the bottom of the text ). Alan KARP's price ( and its winners ) is also a delightfull part of the history :o) – user3666197 Jan 08 '22 at 19:28
  • If I read correctly, the S(s) is the speedup, show on a shorter E2E latency of a process – user3666197 Mar 16 '22 at 16:32
  • If I read correctly, the s is the number of lines, a parallelize-able part was run / executed – user3666197 Mar 16 '22 at 16:33
  • If I read correctly, the p is the fraction of the original latency ( process duration ) – user3666197 Mar 16 '22 at 16:34
  • all that meaning, the ( 1 - p ) is the p* - a non-parallelisable fraction of the S(1) – user3666197 Mar 16 '22 at 16:35
  • all assuming, that ( ( 1 - p ) + p ) == 1 == S( 1 ) == ( 1 / 1 ) is the one-normalised latency ( duration of the original process-flow, where no attempts to parallelise exist ) – user3666197 Mar 16 '22 at 16:38
  • which is an assumption, that DOES NOT hold.There are parts, that got added to the original instruction mix, that become needed now, to prepare & orchestrate s-many lines-of-execution (for just the said p-fraction of the original instruction mix) as were present in the S(1)-instruction mix. But now we have to also account for these new (add-on overhead costs) instructions we now 've to also execute so as to s-many lines-of-code-execution { allocate & prepare LoCE-resources | spawn-LoCE | operate all LoCE | recollect results from all LoCE | release all LoCE-resources } which were not in S(1) ... – user3666197 Mar 16 '22 at 16:46
0

If in doubts - give a try to the interactive tool, linked below, and judge afterwards.

Tl;Dr;

@Richard's view is based on Amdahl's argument, yet does not fit, nor work for your original question:

"How to apply Amdahl's law on a given piece of code?"

This question deserves a two-fold answer :

a)
The original, overhead naive, resources-agnostic, formulation of the Amdahl's law ( if applied on a given piece of code ) answers a principal limit of a "Speedup", an idealised process-flow is subjected to, if operated in "improved" organisation ( using more, parallel, mutually independent lines, that allow some parts of the original process to get organised better ( possibly processing them in parallel ) and thus improve the overall end-to-end process duration ). So Amdahl's argument was not related to CPU-s, CPU-cores, tools for spawning more threads et al, Dr. Gene M. AMDAHL has expressed that original formula for general process-flow orchestrations, re-using and acknowledging admiration of prior prof. Kenneth E. KNIGHT, Stanford School of Business Administration 1966/Sep published work.

b)
no matter how tempting the other question was, Amdahl's law does not answer how fast it will actually run, but only states a principal Speedup limit, that will remain a limit one will never achieve, not even under the most abstract & extremely idealised conditions ( zero-latency, zero-jitter, zero add-on overhead times, zero add-on data-SER/DES-overhead times, zero PAR-work-segments batches SER-scheduling, and many more to name here )

There was published, somewhere in 2017, both a criticism of the weaknesses of using the original Amdahl's argument in contemporary contexts and also an extended formulation of the original Amdahl's argument, to better reflect either of the said naive-usage weaknesses, years ago*. After helping about three years to indeed "Learn more...", as is explicitly written on the click-through-link, it was "redacted".

There is also a visual GUI-tool, one may interact & play with, changing parameters & visually seeing their immediate impact on the resulting principal-Speedup-ceiling. This may help test & see hard impacts way better than by just reading the rest of this article.


Your second question :

"How much will the following code speed up if we run it simultaneously on 8 threads?"

is practical and common in real-world problems, yet Amdahl's law, even the overhead-strict, resources- and atomicity-of-work aware re-formulated version does not directly answer it.

We can ( and shall ) do our professional duty and profile the key phases of the real-hardware process-flow, if we aim at chances to seriously answer this second question, no matter how fuzzy and jitter-dependent out observations might get ( scale-, background workloads-, CPU-cores' thermal throttling effects and other inter-related dependencies matter always - "less" on small scales, but may cause our HPC-scheduled processing killed if running out of HPC-quota, just due our ill-performed or missing add-on overhead analyses ) :

  • what is the overhead add-on cost of a thread, pool-of-threads ( sometimes even a whole Python-interpreter process, incl. its internal state & all of its current data-structures n-many times caused (re)-replication(s) in new RAM allocations, sometimes thus starting O/S resources ' suffocation & swap-thrashing ) (re)-instantiation ... in [ns]

  • what are the overhead add-on costs associated with data ( parameters ) interexchange ... in [ns]

  • what are the resources that potentially block achievable level of concurrent / parallel processing ( sharing, false-sharing, limits of I/O, ... ) that cause independent barriers to processing-orchestrations, even if we have "infinitely"-many free CPU-cores ... this reduces the value of denominator, that will decide about the achievable effects one may expect from actual co-existence of independently flowing process-(code)-execution ( we can claim that having 6 Ferrari cars, we can move "data" in full parallel having ( PAR / 6 ) improvement over going one after another in a pure [SERIAL]-fashion , yet if the road from start to end goes over a bridge, having only 2 lanes, the net effect will degrade into only ( PAR / 2 ) "improvement" on the PAR-section (still not speaking about other overhead costs of loading and unloading our "data" onto and from our sextet of Ferrari "sport-wagens" )

More thoughts on recursive-use in the real-world realm are here and are always device and instruction-mix specific (architecture & Cache details matter awfully lot)

user3666197
  • 1
  • 6
  • 50
  • 92