Approx. of π used to compare Sequential v/s Parallel speeds in java. Why .parallel() was slower?

Question

Can someone please explain me why the sequential version π-approximation was faster than the parallel one?

I can't figure it out

I'm playing around with using a very well-known π-approximation example. I pick random points in the unit square ( ( 0, 0 ) to ( 1, 1 ) ) and see how many of random points do fall inside the area of unit circle. The fraction should be the value of π / 4.

public class PIEstimation {
    final static int NUM_SAMPLES = 100000000;

    public static void main(String[] args) {

    sequentialVersion();
    parallelVersion();
    System.out.println("               Real PI:= " + Math.PI);
    }

    public static void sequentialVersion() {
    final long start = System.nanoTime();

    final long count = LongStream
        .rangeClosed(1, NUM_SAMPLES)
        .filter(e -> {
                double x = Math.random();
                double y = Math.random();
                return x * x + y * y < 1;
    }).count();

    final long duration = ((System.nanoTime() - start) / 1_000_000);

    System.out.println("Sequential Version: PI ~ " + 4.0 * (count / (double) NUM_SAMPLES) + " calculated in "
        + duration + " msecs");
    }

    public static void parallelVersion() {
    final long start = System.nanoTime();

    final long count = LongStream
        .rangeClosed(1, NUM_SAMPLES)
        .parallel()
        .filter(e -> {
                double x = Math.random();
                double y = Math.random();
                return x * x + y * y < 1;
    }).count();

    final long duration = ((System.nanoTime() - start) / 1_000_000);

    System.out.println("  Parallel Version: PI ~ " + 4.0 * (count / (double) NUM_SAMPLES) + " calculated in "
        + duration + " msecs");
    }

}

The results:

Sequential Version: PI ~ 3.14176568 calculated in  4893 msecs
  Parallel Version: PI ~ 3.1417546  calculated in 12044 msecs
               Real PI:= 3.141592653589793

`Math.random()` scales poorly under heavy contention. Try replacing `Math.random()` with `ThreadLocalRandom.current().nextDouble()`. — Misha, Sep 13 '17 at 18:14
Possible duplicate of [Java 8's streams: why parallel stream is slower?](https://stackoverflow.com/questions/23170832/java-8s-streams-why-parallel-stream-is-slower) — Jacob G., Sep 13 '17 at 18:15
@JacobG. Not a duplicate as far as I can see. Answer + explanation forthcoming. — Stuart Marks, Sep 13 '17 at 20:19
Experimentation did not proof the speedup ( URL to platform with both the code and results below in 1st Section, 2nd paragraph ). Besides the blocking-nature of the "monopolistic-RNG-authority" described below, there are also other process- and resources-management related overheads to take into account. Plus if the outer `.count()` method is not JIT-translated into a progressive / incremental count-processing inside all `.parallel()` code-executions, but just a post-processing method on a final stream, assembled by `.filter()`, speedup achievable from `.parallel()` is lost ( ref. test data ) — user3666197, Sep 14 '17 at 03:49
Same question here [1](https://stackoverflow.com/questions/44294483/why-does-this-parallel-stream-run-15x-slower) and here [2](https://stackoverflow.com/questions/41380318/why-is-parallel-stream-slower) — Stefan Zobel, Sep 14 '17 at 07:31
Well, with all due respect, **none were the same** -- after **theoretical explanations below** were ( almost ) rejected, also the thorough experimentation has proven **`LongStream`-approach still spends almost double time on `.parallel().filter().count()` even when the first principal performance-blocker was removed** ( and a `ThreadLocalRandom` source was used, so as to avoid paying costs of beackpropagated feedbacks to a `Math.random()` "centralised"-source-of-randomness ( ref. below for details ) ). **The 1st, logical, step does not explain the whole root cause**. Experiments confirm this. — user3666197, Sep 14 '17 at 14:48
The root cause is the Amdahl's Law applied on the actual process-execution scheduling -- it **explains both** the adverse impact from the performance-side-effect of the `[SEQ]`-blocker ( if left inside `[PAR]`-section ( explained in detail below ) ) and **also** a poor speedup, if `[PAR]`-section is so "short", that all the `[PAR]`-overheads get hardly paid, if covered at all, from benefits of a parallelisation on small to moderate amount of `N`-s ( explained below ). So, **the overhead-strict Amdahl's Law is a principal part of a professional design of any high performance code. `Q.E.D.`** — user3666197, Sep 14 '17 at 15:47

score 4 · Accepted Answer · edited Sep 20 '17 at 14:53

4

I get even worse results running in parallel on my machine (3.0 GHz Intel Core i7, two cores, four threads):

sequential: PI ~ 3.14175124 calculated in  4952 msecs
  parallel: PI ~ 3.14167776 calculated in 21320 msecs

I suspect the main reason is that Math.random() is thread-safe, and so it synchronizes around every call. Since there are multiple threads all trying to get random numbers at the same time, they're all contending for the same lock. This adds a tremendous amount of overhead. Note that the specification for Math.random() says the following:

This method is properly synchronized to allow correct use by more than one thread. However, if many threads need to generate pseudorandom numbers at a great rate, it may reduce contention for each thread to have its own pseudorandom-number generator.

To avoid lock contention, use ThreadLocalRandom instead:

long count = LongStream.rangeClosed(1, NUM_SAMPLES)
                       .parallel()
                       .filter(e -> {
                           ThreadLocalRandom cur = ThreadLocalRandom.current();
                           double x = cur.nextDouble();
                           double y = cur.nextDouble();
                           return x * x + y * y < 1;
                       })
                       .count();

This gives the following results:

sequential2: PI ~ 3.14169156 calculated in 1171 msecs
  parallel2: PI ~ 3.14166796 calculated in  648 msecs

which is 1.8x speedup, not too bad for a two-core machine. Note that this is also faster when run sequentially, probably because there's no lock overhead at all.

Aside: Normally for benchmarks I'd suggest using JMH. However, this benchmark seems to run long enough that it gives a reasonable indication of relative speeds. For more precise results, though, I do recommend using JMH.

UPDATE

Here are additional results (requested by user3666197 in comments), using a NUM_SAMPLES value of 1_000_000_000 compared to the original 100_000_000. I've copied the results from above for easy comparison.

NUM_SAMPLES = 100_000_000

sequential:  PI ~ 3.14175124 calculated in    4952 msecs
parallel:    PI ~ 3.14167776 calculated in   21320 msecs
sequential2: PI ~ 3.14169156 calculated in    1171 msecs
parallel2:   PI ~ 3.14166796 calculated in     648 msecs

NUM_SAMPLES = 1_000_000_000

sequential:  PI ~ 3.141572896 calculated in  47730 msecs
parallel:    PI ~ 3.141543836 calculated in 228969 msecs
sequential2: PI ~ 3.1414865   calculated in  12843 msecs
parallel2:   PI ~ 3.141635704 calculated in   7953 msecs

The sequential and parallel results are (mostly) the same code as in the question, and sequential2 and parallel2 are using my modified ThreadLocalRandom code. The new timings are overall roughly 10x longer, as one would expect. The longer parallel2 run isn't quite as fast as one would expect, though it's not totally out of line, showing about a 1.6x speedup on a two-core machine.

edited Sep 20 '17 at 14:53

user3666197

1
6
50
92

answered Sep 13 '17 at 22:25

Stuart Marks

127,867
37
205
259

Stuart, are you sure the JIT-compiler is implementing also the final ( outer ) `.count()` on transformed `LongStream` incrementally in parallel? Did some testing of several implementation approaches, but on a hosted platform, that restricts my control of the number of CPU / threads to use -- **could you kindly re-run the code** as is URL-referred in the first Section, at the end of the paragraph two ( of my answer ) **on your host, with some higher `NUM_SAMPLES`** so as to get outside a noise level ~ `1E+10`, **and post the outputs, to see the realistic differences?** Thank you in advance. – user3666197 Sep 14 '17 at 03:27
( the tests for just `1E+9` samples result on the above cited platform about **twice slower** to run a `LongStream` with `.parallel().filter().count()` with `ThreadLocalRandom` than even a pure `[SEQ]` version runing a trivial naive `while(){...}` or a split-task `.parallel()` with distributed summations and just top-level `.map( while(){...} ).sum()` ) – user3666197 Sep 14 '17 at 03:31
1

@user3666197 I reran my code with a larger NUM_SAMPLES value and have posted the results. Sorry, I don't have time to run and analyze your code. – Stuart Marks Sep 14 '17 at 05:25
2

@user3666197 Also, anything can be happening in a hosted environment, so it's probably quite counterproductive to run performance benchmarks in that environment. I don't think it's worth worrying about trying to explain those results. – Stuart Marks Sep 14 '17 at 05:31
Thx Stuart for update, but sorry, you **have missed the point.** The point is to **also run the non-`LongStream` (trivial `while(){...}`-based) approaches on the very same platform as you ran the previous tests**. I respect your time, but not running the B-part of the A/B-test scenario ( by just re-running A-part ) is rather a surprise from a professional. **The tests were pre-setup just to use the URL/copy/paste/run, so I hope you did not feel any discomfort/pressure in being asked to re-run the A/B-tests on your platform, not to "escape" from hosted, but to make tests coherent with yours.** – user3666197 Sep 14 '17 at 06:21
As demonstrated before, the **non**-`LongStream` approaches **were way faster** ~ 2x at 1E+9 samples **than your fastest** `parallel2` tests executed on your `localhost`-dualCore-4HT i7. The solution in `PAR3/4wTLR PI ~ 3.141657656 calculated in 7489[ms]`, which were tested to have been run on not more than just 2-threads, instead of your 4-threads on your device, used for the said test. **That is the point to re-run it on 4-threads on your device**, so as the **performance then becomes rigorously comparable to your numbers about your fastest `parallel2 ... 7953 msecs`** – user3666197 Sep 14 '17 at 06:36
1

As a side note, since the likelihood of being inside the circle is higher than being outside, there’s the simple optimization of counting the misses instead of the hits. – Holger Sep 14 '17 at 16:43
@Holger this principal observation fails to work in cases, where `LongStream.rangeClosed()` with ad-hoc `.filter()` was being used for generated representation of the randomness ( i.e. the process has to generate both the misses and hits, so as to be able to count 'em ). **There is no such randomness-generator, that would *"know-about-how"* or *"allow-to-take-a-shortcut"* to *"skip"* all the non-counted samples so as to somehow accelerate the counting** ( and if it were, it would not be a reasonable source of randomness, would it? ) – user3666197 Sep 19 '17 at 17:18
@user3666197: you can’t skip the generation of random numbers, but counting them is an operation too (an addition), which will be skipped for elements not matching the predicate. In cases where the JVM fails to inline the entire pipeline, it might bear even additional overhead. Not big compared to the random number generation, but the difference is measurable, and it only requires changing a `<` to a `>=` and prepending a `4-` to the final result calculation… – Holger Sep 20 '17 at 06:34
@Holger this nano-difference is clear ( also might have presented an argument, the stream-related ( allocation+management+`.count()` ) overheads might get proportionally smaller ~ (PI/4) : (1-PI/4) ~ 785:215 ~ 3.65 x, **but one may still notice, the core speed-loss was in using `LongStream` ( with its nest of methods ) compared to a plain `while(){...}`** Mirrored-counting >= 1. does not enjoy more than about ~200 [ms] speedup to original-counting (@1E+9) **whereas `while(){...}` code is still about twice faster** than the `.rangeClosed().filter().count()` ( 2x time lost on void abstractions ) – user3666197 Sep 20 '17 at 14:29
So **if The Motivation** was performance, a cautious design practice shall validate a use of the just-enough tools without adding any unreasonable steps ( the overhead-strict Amdahl Law formulation shows the cruel penalties if not obeying this design practice -- **resulting in slower parallel code-execution, compared to ordinary serial code-execution**. The reason is in the very add-on overhead cost, not the nature of the parallel execution. Detailed experimental data confirm this in-vivo ( re-test-able below ). **Q.E.D.** ) – user3666197 Sep 20 '17 at 14:35
2

@user3666197: note that [this comment](https://stackoverflow.com/questions/46201094/46207700?noredirect=1#comment79411932_46207700) was clearly marked “*As a side note*”, not claiming to address the “Stream vs loop” issue in any way. I know that there are performance issues with the Stream API combined with the HotSpot JVM, mostly stemming from ancient, never-revised defaults, most notably the hardcoded maximum inline threshold of nine nested method invocations, which will be exceeded more than often with the Stream API. – Holger Sep 20 '17 at 15:10

user3666197 · Answer 2 · 2017-12-08T18:21:16.320

Prologue:

if The Motivation for going parallel was indeed a higher performance, a cautious design practice shall validate a use of the just-enough tools without adding any unreasonable steps
_{( the overhead-strict Amdahl Law formulation shows the cruel penalties if not obeying this design practice -- resulting in slower parallel code-execution, compared to ordinary serial code-execution. The reason is in the very add-on overhead cost, not the nature of the parallel execution. Detailed experimental data confirm this in-vivo ( re-test-able below ). Q.E.D. )}

                              PAR:  sTLR PI ~ 3.141674092  calculated in 14386[ms] | log10 DIFF ~ -4.0891707129672366
                              PAR:  sTLR PI ~ 3.141483408  calculated in 14007[ms] | log10 DIFF ~ -3.9615960863226483
                              PAR:  sTLR>PI ~ 3.141480016  calculated in 14249[ms] | log10 DIFF ~ -3.948316651088744
                              PAR:  sTLR>PI ~ 3.141588188  calculated in 13967[ms] | log10 DIFF ~ -5.350121173515822
                                    |  ||                                  |
 LongStream.rangeClose()------------+  ||                                  |
 ThreadLocalRandom src-of-randomness---+|                                  |
 { _: original | >: mirrored }-counting-+                                  |   
                                                                           |
                              PAR3/2wTLR PI ~ 3.141519024  calculated in  7394[ms] | log10 DIFF ~ -4.132947619067715
                              PAR3/2wTLR PI ~ 3.141737748  calculated in  7741[ms] | log10 DIFF ~ -3.8383493185270883
                              PAR3/2wTLR>PI ~ 3.141537576  calculated in  6751[ms] | log10 DIFF ~ -4.491034048285749
                              PAR3/2wTLR>PI ~ 3.14154848   calculated in  7272[ms] | log10 DIFF ~ -4.35483730610719
                                   ||  ||                                  |
 Amdahl-Law-fair-2-blks/2-threads--+|  ||                                  |
 while(){...}-----------------------+  ||                                  |
 ThreadLocalRandom src-of-randomness---+|                                  |
 { _: original | >: mirrored }-counting-+                                  |

Why a Sequential version may get faster than a Parallel?

Depends a lot on implementation details.

A platform code-execution timings were observed to bear a noise of about +/- 100 [ms] for NUM_SAMPLES ~ 1E+8 dart-throws. May Ctrl+C / Ctrl+V / RUN all listed code-approaches for testing different implementations and to be able re-test all approaches with a chance of adding any better alterations from here. _{N.b.: Running all tests makes sense only if and only if all of them are being run on the very same platform ( be it your localhost, your preferred TEST-infrastructure or the "here" -- that does not matter per-se, but being it the same platform matters if the observed results have an ambition to become comparable each to the other ( apples to apples, not the apples to oranges ) )} Tests self-document performance, just run 'em

On all parallelisation-ready platforms, there is a risk of a resource-contention, if the process is heavily dependent on a source-of-randomness ( we call it twice per a dart-shot onto PI-boundary experiment ) and such platform implements this peculiar-magic-box ( because of the cryptographically required pure-[SERIAL]-( thus non-parallelisable )-feature of such source-of-randomness ) hardwired onto some hardware element ( i.e. even infinitely parallelisable process will stand in a pure [SEQ]-queue, for getting a cryptographically-robust next-random number ). If some platform can generate massively-parallel random numbers, typically such platform instantiates so many "independent"-sources-of-randomness as necessary, so as to perform in-stream cryptographically-robust sequence of random numbers, but this requires some other approach than a "shared"-sequence-pooling from Math using a call to its .random()-method

Why `LongStream` version was almost twice slower than a naive loop?

First, let me note, that using any high-level abstraction is the first risk of loosing controls over efficient parallelisation. Yes, LongStream methods are pretty attractive to implement some scheme on a few SLOCs, but here, the dumb-level code setup shows, that you systematically loose about ~ 1110 - 1200 [ms] to stream-based processing:

----< extended tests >----
( the tests for just 1E+9 samples result on the above cited platform
  about
  twice slower to run
     a LongStream with .parallel().filter().count()
     with ThreadLocalRandom
  than even
     a pure [SEQ] version, running a trivial naive while(){...}
  or a split-task .parallel() with distributed summations
     and just top-level .map( while(){...} ).sum()
  )
                          1E+9
                              SEQ   wTLR PI ~ 3.141553264 calculated in  7386[ms] | log10 DIFF ~ -4.404618541952148
                              SEQ   wTLR PI ~ 3.1414793   calculated in  7578[ms] | log10 DIFF ~ -3.9455647216537932
                              SEQ   wTLR PI ~ 3.141631852 calculated in  6913[ms] | log10 DIFF ~ -4.40673154655876
                              PAR:  sTLR PI ~ 3.141556252 calculated in 14293[ms] | log10 DIFF ~ -4.438879648677778
                              PAR:  sTLR PI ~ 3.141610744 calculated in 13919[ms] | log10 DIFF ~ -4.742551585240871
                                    |  |                                  |
 LongStream.rangeClose()------------+  |                                  |
 ThreadLocalRandom src-of-randomness---+                                  |
                                                                          |
                              PAR3/2wTLR PI ~ 3.141519396 calculated in  6944[ms] | log10 DIFF ~ -4.135147373916102
                                   ||  |                                  |
 Amdahl-Law-fair-2-blks/2-threads--+|  |                                  |
 while(){...}-----------------------+  |                                  |
 ThreadLocalRandom src-of-randomness---+                                  |
                                                                          |
                              PAR3/4wTLR PI ~ 3.141657656 calculated in  7489[ms] | log10 DIFF ~ -4.187070539970178
                                   ||  |                             vs 14xxx
 Amdahl-Law-fair-4-blks/2-threads--+|  | ~~~~~~~~~~~~~ /4+ threads???
 while(){...}-----------------------+  |
 ThreadLocalRandom src-of-randomness---+

                          1E+9
                              SEQ   wTLR PI ~ 3.141670832 calculated in 7142[ms] | log10 DIFF ~ -4.106913165387672
                              SEQ   wTLR PI ~ 3.141631852 calculated in 6913[ms] | log10 DIFF ~ -4.40673154655876
                              PAR3/4wTLR PI ~ 3.141499044 calculated in 7127[ms] | log10 DIFF ~ -4.028679657876861
                              PAR3/4wTLR PI ~ 3.141521608 calculated in 7420[ms] | log10 DIFF ~ -4.148462876047807
                              PAR3/2wTLR PI ~ 3.141519396 calculated in 6944[ms] | log10 DIFF ~ -4.135147373916102
                              PAR3/2wTLR PI ~ 3.141602576 calculated in 6954[ms] | log10 DIFF ~ -5.0033828225658254
                              Real       PI = 3.141592653589793

                              SEQ   wTLR PI ~ 3.141622264 calculated in 6967[ms] | log10 DIFF ~ -4.52855557608415
                              SEQ   wTLR PI ~ 3.141507096 calculated in 6894[ms] | log10 DIFF ~ -4.067741458256534
                              PAR3/4wTLR PI ~ 3.141655584 calculated in 7106[ms] | log10 DIFF ~ -4.2011394373291715
                              PAR3/4wTLR PI ~ 3.141564284 calculated in 7386[ms] | log10 DIFF ~ -4.547146943793993
                              PAR3/2wTLR PI ~ 3.141547824 calculated in 6919[ms] | log10 DIFF ~ -4.348435235065685
                              PAR3/2wTLR PI ~ 3.141512504 calculated in 6959[ms] | log10 DIFF ~ -4.096098696030472
                              Real       PI = 3.141592653589793

 -----< inital tests >-----
                             stream SEQ: PI ~ 3.14148516 calculated in 2946 msecs
                              naive SEQ2 PI ~ 3.14149424 calculated in 1830 msecs

                             stream PAR: PI ~ 3.14171496 calculated in 2853 msecs
                             stream PAR2 PI ~ 3.1414986  calculated in 3017 msecs
                              naive PAR3 PI ~ 3.14157392 calculated in 1842 msecs
                                    Real PI = 3.141592653589793

                             stream SEQ: PI ~ 3.1416566  calculated in 2977 msecs
                              naive SEQ2 PI ~ 3.14167044 calculated in 1868 msecs
                             stream PAR: PI ~ 3.1415044  calculated in 2617 msecs
                             stream PAR2 PI ~ 3.14159492 calculated in 3022 msecs
                              naive PAR3 PI ~ 3.14154072 calculated in 1811 msecs
                                    Real PI = 3.141592653589793
 -----< extended tests >-----
                              SEQ:   MrndPI ~ 3.14184796 calculated in 3003[ms] | log10 DIFF ~ -3.5929382808376515
                              SEQ:   MrndPI ~ 3.14152884 calculated in 2769[ms] | log10 DIFF ~ -4.195086823728298
                              SEQ2  wMrndPI ~ 3.14192708 calculated in 1895[ms] | log10 DIFF ~ -3.475699432925146
                              SEQ2  wMrndPI ~ 3.14171308 calculated in 1846[ms] | log10 DIFF ~ -3.9192782593456794
                              SEQ   wTLR PI ~ 3.14165348 calculated in  739[ms] | log10 DIFF ~ -4.215907813544407
                              SEQ   wTLR PI ~ 3.14136852 calculated in  749[ms] | log10 DIFF ~ -3.649493053020102
                              PAR:  sMrndPI ~ 3.14179224 calculated in 2857[ms] | log10 DIFF ~ -3.699869033054068
                              PAR:  sMrndPI ~ 3.141472   calculated in 2652[ms] | log10 DIFF ~ -3.9184597520458992
                              PAR2  sMrndPI ~ 3.14140116 calculated in 2940[ms] | log10 DIFF ~ -3.717845759366814
                              PAR2  sMrndPI ~ 3.14157204 calculated in 2930[ms] | log10 DIFF ~ -4.685846370584897
                              PAR3/1sTLR PI ~ 3.14143636 calculated in 2044[ms] | log10 DIFF ~ -3.8060588337186982
                              PAR3/1sTLR PI ~ 3.14131304 calculated in 2042[ms] | log10 DIFF ~ -3.5534417248121435
                              PAR3/4wMrndPI ~ 3.14142288 calculated in 1832[ms] | log10 DIFF ~ -3.770129868268526
                              PAR3/4wMrndPI ~ 3.1417362  calculated in 1785[ms] | log10 DIFF ~ -3.843007663821216
                              PAR3/4wTLR PI ~ 3.14152796 calculated in  746[ms] | log10 DIFF ~ -4.189138749553116
                              PAR3/4wTLR PI ~ 3.14163412 calculated in  748[ms] | log10 DIFF ~ -4.382303560363832
                              PAR3/2wTLR PI ~ 3.14163616 calculated in  757[ms] | log10 DIFF ~ -4.361446749657272
                              PAR3/2wTLR PI ~ 3.14137776 calculated in  732[ms] | log10 DIFF ~ -3.6677765391807546
                              Real       PI = 3.141592653589793

                          1E+8
                              SEQ   wTLR PI ~ 3.14173396 calculated in  710[ms] | log10 DIFF ~ -3.8498381364228056
                              SEQ   wTLR PI ~ 3.14174612 calculated in  714[ms] | log10 DIFF ~ -3.813986665516408
                              PAR3/4wMrndPI ~ 3.14170104 calculated in 2022[ms] | log10 DIFF ~ -3.9650251674482195
                              PAR3/4wMrndPI ~ 3.1418928  calculated in 1725[ms] | log10 DIFF ~ -3.522666846499994
                              PAR3/4wTLR PI ~ 3.14170568 calculated in  734[ms] | log10 DIFF ~ -3.9468200656590717
                              PAR3/4wTLR PI ~ 3.14148048 calculated in  749[ms] | log10 DIFF ~ -3.9501093815571333
                              PAR3/2wTLR PI ~ 3.14175252 calculated in  709[ms] | log10 DIFF ~ -3.7962427769936427
                              PAR3/2wTLR PI ~ 3.14175248 calculated in  741[ms] | log10 DIFF ~ -3.796351454937919
                              Real       PI = 3.141592653589793

What must always know before making a `[SEQ] | [PAR]` decision?

Such a decision is neither toss of a coin, nor an easy & straightforward step. We hear many times about immense powers of parallel-computing but the Mother Nature does not let us gain without pain.

Gene AMDAHL has formulated a law, which demonstrates speedups possible for a process, if more processors get harnessed.

The story is not trivial even in the simplified Amdahl's Law formulation ( details will be put later ).

Given a process-P, we may assume,
for every P there is always a part scheduled and executed in a [SERIAL]-way ( let's call it a [SEQ]-part ) and similarly
for every P there may be one or more parts of P, that may get executed in a [PARALLEL]-fashion ( as resources and mutual-(mis)-coordination permit ).

Amdahl has put such pure-[SERIAL] process-execution into comparison with a possibly [SEQ]+[PAR] process-execution, using the former-part as is and gaining the possible speedup just from the latter-part, by doing a split among N-processors, the speedup S is:

            1
S =  ________________; where
          ( 1 - s )        s := duration a[SEQ]-part took
     s  + _________     (1-s):= duration a[PAR]-part took before split
              N            N := number of [PAR]-processors to split the task onto

Even this simplified formulation immediately warns, that:

The more the [SEQ]-part took, the less speedup the [PAR]-part may deliver by split-onto any amount of N > 1

So the first output for one's decision making is -- isolate and benchmark the [SEQ]-part and [PAR]-part, so as to get a first approximation of what levels of speedup are principally set ( for even infinite amount of processors )

Next comes the hard part of the truth:

There is no such thing as a free dinner. The Mother Nature steps in, there is no gain without pain.

One can always spend some reasonable amount of science on re-factoring the [SEQ]-part, which yields immediate effects on the overall process-duration. Theory of [?TIME][?SPACE]-complexity ( ? == { c-onst | p-olynomial | exp-onential } ) principally introduces potential costs and savings achievable in these efforts. Benchmarks, meaning REALISTIC benchmarks will tell the difference. What could work nice and smooth on in-cache-sizes, will choke and progress much slower on beyond-cache-size scales, etc., etc. So, always design REALISTIC benchmarks at scales, that remain relevant for the target process-scaling, otherwise poor decisions come from a table of such a scaling-naive experimentator.

Next, one can harness some parallel-computing tools, so as to get the [PAR]-part executed faster on some available sub-set of processors.

Given a [SEQ]-part was squashed close to the silicon-limits, and the [PAR]-part duration remains of a reasonable portion of the whole original process-duration, the [PAR]-tools may indeed help --- BUT AT A COST --- yes, one has always pay some cost of "arranging" all the toys so as to start executing in [PAR]. Similarly, at the end of the [PAR]-scheduled code-execution, there is another step, to somehow collect the results, deactivate / terminate the individual [PAR]-executions and let these results delivered back into the [SEQ]-part of the code, where the ( hopefully ) "accelerated" results get further used.

All this is called an overhead. The overhead, let's call it o, remains hidden from one's eyes in the simplified ( overhead-naive ) Amdahl's Law formulation, but the real-world execution will always demonstrate this add-on part in benchmarking, as the code cannot "skip" it and somehow exhibit a magic instruction for a direct "jump"-from-[SEQ]-into-[PAR] and a "jump"-from-[PAR]-back-to-[SEQ], such things simply do not exist...

Overhead-strict formulation of the Amdahl's Law speedup S:

               1
S =  __________________________; where s, ( 1 - s ), N were defined above
                ( 1 - s )            pSO:= [PAR]-Setup-Overhead     add-on
     s  + pSO + _________ + pTO      pTO:= [PAR]-Terminate-Overhead add-on
                    N

Reading the both formulations, one need not be any mathematical genius to realise, that it is fairly enough if one will have to principally pay some expensive costs of ( pSO + pTO ) to then realise the [PARALLEL]-version will get slower, not faster than a pure [SERIAL]-version of a process, that was naively expected to get theoretically accelerated, even if infinite amount of processors will get harnessed.

Q.E.D.

In the Epilogue section of this, there is an easy to experiment interactive + animated toy-tool with sliders for both o, p == ( 1 - s ), to evaluate the effects of overhead-strict Amdahl's Law for one's individual [SEQ]-[PAR]-process compositions. Feel free to click-through and experiment on your own.

This will help you test, when principal [PAR]-overheads start to get at least adjusted by N-CPU-[PARALLEL]-execution and when any such efforts remain wasted in just a principally worse performance, than a [SERIAL]-process execution, which is always, in this sense, an overhead-free and cheap competitor.

I could understand, there might have been some rationale to delete the first comment from Eugene. **But what better was created by deleting** my response to not only Eugene, but also to all others, who might have found themselves having some doubts yet after possibly just fast-reading through the rather long and complex to follow answer? **What new benefit did such an executed Right-to-Delete actually create?** — user3666197, Sep 13 '17 at 20:42
If you want to add things to your answer, use the [edit] link, not the comments. It would have made no sense to keep a reply to Eugene when his comment was being deleted. In general, it makes no sense to have information buried in the comments. — Cody Gray - on strike, Sep 19 '17 at 16:20
@CodyGray With all due respect, you did not answer the question. Your expressed opinion was registered, but not answering any of the questions above. Is there any technical way to re-read the such deleted comments after an execution of some administrative power has censored such text? — user3666197, Sep 19 '17 at 17:04
No. Comments are temporary and exist only to suggest clarifications or ask for more information. They aren't archived anywhere, and they are regularly cleaned up when they become obsolete. Again, there is no censorship here. If you want to express your opinions, put them in your answer. — Cody Gray - on strike, Sep 19 '17 at 17:06
@CodyGray Understood. Thanks. ( FYI, have seen many examples of censorship, even deleting reasonable & merit-focused parts of the text from Answers either, so may be, your expressed belief is somewhat disconnected ( as far as I can observe it in person ) from the reality. If that happens to me, the law of symmetry means it is in general highly probable it happens to many other members here ). — user3666197, Sep 19 '17 at 17:08

Approx. of π used to compare Sequential v/s Parallel speeds in java. Why .parallel() was slower?

2 Answers2

Prologue:

Why a Sequential version may get faster than a Parallel?

Why `LongStream` version was almost twice slower than a naive loop?

What must always know before making a `[SEQ] | [PAR]` decision?

Even this simplified formulation immediately warns, that:

Next comes the hard part of the truth:

Q.E.D.

Linked

Approx. of π used to compare Sequential v/s Parallel speeds in java. Why .parallel() was slower?

2 Answers2

Prologue:

Why a Sequential version may get faster than a Parallel?

Why LongStream version was almost twice slower than a naive loop?

What must always know before making a [SEQ] | [PAR] decision?

Even this simplified formulation immediately warns, that:

Next comes the hard part of the truth:

Q.E.D.

Linked

Why `LongStream` version was almost twice slower than a naive loop?

What must always know before making a `[SEQ] | [PAR]` decision?