Why?
Because trying to use tools in cases,
where tools principally cannot and DO NOT adjust the costs of entry:
I love python.
I pray educators better explain the costs of tools, otherwise we get lost in these wish-to-get [PARALLEL]
-schedules.
A few facts:
No.0: With a lot of simplification, python intentionally uses GIL to [SERIAL]
-ise access to variables and thus avoiding any potential collision from [CONCURRENT]
modifications - paying these add-on costs of GIL-stepped dancing in extra time
No.1: [PARALLEL]
-code execution is way harder than a "just"-[CONCURRENT]
( read more )
No.2: [SERIAL]
-process has to pay extra costs, if trying to split work onto [CONCURRENT]
-workers
No.3: If a process does inter-worker communication, immense extra costs per data exchange are paid
No.4: If hardware has few resources for [CONCURRENT]
processes, results get way worse further
To have some smell of what can be done in standard python 2.7.13:
Efficiency is in better using silicon, not in bulldozing syntax-constructors into territories, where they are legal, but their performance has adverse effects on the experiment-under-test end-to-end speed:
You pay about 8 ~ 11 [ms]
just to iteratively assemble an empty array1
>>> from zmq import Stopwatch
>>> aClk = Stopwatch()
>>> aClk.start();array1 = [ 0 for i in xrange( 100000 ) ];aClk.stop()
9751L
10146L
10625L
9942L
10346L
9359L
10473L
9171L
8328L
( the Stopwatch().stop()
method yields [us]
from .start()
)
while, the memory-efficient, vectorisable, GIL-free approach can do the same about +230x ~ +450x faster:
>>> import numpy as np
>>>
>>> aClk.start();arrayNP = np.zeros( 100000 );aClk.stop()
15L
22L
21L
23L
19L
22L
>>> aClk.start();arrayNP = np.zeros( 100000, dtype = np.int );aClk.stop()
43L
47L
42L
44L
47L
So, using the proper tools just starts the story of performance:
>>> def test_SERIAL_python( nLOOPs = 100000 ):
... aClk.start()
... for i in xrange( nLOOPs ): # py3 range() ~ xrange() in py27
... array1[i] = i**2 # your loop-code
... _ = aClk.stop()
... return _
While a naive [SERIAL]
-iterative implementation works, you pay immense costs for opting to do so ~ 70 [ms]
for a 100000-D vector:
>>> test_SERIAL_python( nLOOPs = 100000 )
70318L
69211L
77825L
70943L
74834L
73079L
Using a more suitable / appropriate tool costs just ~ 0.2 [ms]
i.e. ++350x FASTER
>>> aClk.start();arrayNP[:] = arrayNP[:]**2;aClk.stop()
189L
171L
173L
187L
183L
188L
193L
and with another glitch, a.k.a. an inplace modus-operandi:
>>> aClk.start();arrayNP[:] *=arrayNP[:];aClk.stop()
138L
139L
136L
137L
136L
136L
137L
Yields ~ +514x SPEEDUP, just from using appropriate tool
The art of performance is not in following marketing-sounding claims
about parallellizing-( at-any-cost ),
but in using know-how based methods, that pay least costs for biggest speedups achievable.
For "small"-problems, typical costs of distributing "thin"-work-packages are indeed hard to get covered by any potentially achievable speedups, so "problem-size" actually limits one's choice of methods, that could reach positive gain ( speedups of 0.9 or even << 1.0 are so often reported here, on StackOverflow, that you need not feel lost or alone in this sort of surprise ).
Epilogue
Processor number counts.
Core number counts.
But cache-sizes + NUMA-irregularities count more than that.
Smart, vectorised, HPC-cured, GIL-free libraries matter
( numpy
et al - thanks a lot Travis OLIPHANT & al ... Great Salute to his team ... )
As an overhead-strict Amdahl Law (re-)-formulation explains, why even many-N
-CPU parallelised code execution may ( and indeed often does ) suffer from speedups << 1.0
Overhead-strict formulation of the Amdahl's Law speedup S
includes the very costs of the paid [PAR]-
Setup + [PAR]-
Terminate Overheads, explicitly:
1
S = __________________________; where s, ( 1 - s ), N were defined above
( 1 - s ) pSO:= [PAR]-Setup-Overhead add-on
s + pSO + _________ + pTO pTO:= [PAR]-Terminate-Overhead add-on
N
( an interactive animated tool for 2D visualising effects of these performance constraints is cited here )