python parallel processing running all tasks on one core - multiprocessing, ray

Question

I have a model.predict()-method and 65536 rows of data which takes about 7 seconds to perform. I wanted to speed this up using the joblib.parallel_backend tooling using this example.

this is my code:

import numpy as np
from joblib import load, parallel_backend
from time import clock as time

from urllib.request import urlopen

NN_model=load(urlopen("http://clima-dods.ictp.it/Users/tompkins/CRM/nnet_3var.jl"))

npt=65536
t=np.random.uniform(low=-1,high=1,size=npt)
u=np.random.uniform(low=-1,high=1,size=npt)
q=np.random.uniform(low=-1,high=1,size=npt)
X=np.column_stack((u,t,q))

t0=time()
out1=NN_model.predict(X)os.system('taskset -cp 0-%d %s' % (ncore, os.getpid()))

t1=time()
print("serial",t1-t0)
with parallel_backend('threading', n_jobs=-1):
    out2=NN_model.predict(X)
t2=time()
print("parallel",t2-t1)

And these are my timings:

serial   6.481805
parallel 6.389198

I know from past experience that very small tasks are not speeded up by parallel shared memory techniques due to the overhead, as is also the posted answer here, but this is not the case here, as the job is 7 seconds and should far exceed any overhead. In fact, I traced the load on the machine and it seems to only be running in serial.

What am I doing wrong with the joblib specification? How can I use threading on my desktop to parallelize this task with joblib (or an alternative)?

Edit 1

From the post below, I was wondering if the application of joblib attempts to apply parallelization to model itself, rather than dividing up the rows of data into ncore batches to distribute to each core. Thus I decided that maybe I would need to do this division manually myself and farm the out the data "chunks" to each core. I've thus tried to use now Parallel and delay instead, chunking the data as per this post,

from joblib import Parallel, delayed 

ncore    = 8
nchunk   = int( npt / ncore )
parallel = Parallel( n_jobs = ncore )
results  = parallel( delayed( NN_model.predict )
                            ( X[i*nchunk:(i+1)*nchunk,:] )
                     for i in range( ncore )
                     )

This now runs ncore-instances on my machine, but they are all running at 1 / ncore efficiency (as if it were gating?) and the wall-clock is still not improved...

Edit 2

As an alternative, I have now also tried to do the manual division of the dataset using the multiprocessing package,

import  multiprocessing 
def predict_chunk(Xchunk):
    results=NN_model.predict(Xchunk)
    return (results)

pool=multiprocessing.Pool(processes=ncore)
os.system('taskset -cp 0-%d %s' % (ncore, os.getpid()))
stats=pool.starmap(predict_chunk,([X[i*nchunk:(i+1)*nchunk,:]] for i in range(ncore)))
res=np.vstack(stats).flatten()
pool.close()
pool.join()

Apart from the overhead of dividing the input data up and restacking the results, the problem should be embarrassingly parallel. Then I recalled earlier posts, and was wondering if the issue with the slow performance was due to the task affinity issue upon importing numpy as reported here, so I added the os.system command, but that doesn't seem to help, I still get each of 8 cores using around 12% of their CPU load and an overall timing that is now slightly slower than the serial solution due to the aforementioned overhead.

Edit 3

I've now tried to use ray instead

import ray

@ray.remote
def predict_chunk(Xchunk,start,end):
    results=NN_model.predict(Xchunk[start:end,:])
    return (results)

ray.init(num_cpus=ncore)
data_id=ray.put(X)
stats=ray.get([predict_chunk.remote(data_id,i*nchunk,(i+1)*nchunk) for i in range(ncore)])
res=np.vstack(stats).flatten()

Again, this creates 8 sub-processes, but they are all running on a single CPU and thus the parallel process is slower than the serial.

I'm almost certain this is related to the affinity issue referred to above, but the solutions don't seem to be working.

This is a summary of the architecture:

Linux hp6g4-clima-5.xxxx.it 4.15.0-124-generic #127-Ubuntu SMP Fri Nov 6 10:54:43 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux

you use backend `threading` but maybe it has problem because Python uses GIL to run only one thread a time. — furas, Dec 16 '20 at 22:07
As advised below, kindly review not only a sole number of CPU-cores, but also the NUMA-architecture ( cache-hierarchy and the actual ***amount of physical CPU-to-memory-CHNLs*** - there will be a mem-I/O bottleneck for (re)-fetches ( as NN.predict() transforms 65k 3(in)-NN-n(out) at close to zero re-use cache-hits, being about a half of them cross-QPI in a best case, being all of 'em cross-QPI "slow" in a w/c ) for details one may inspect device's NUMA with hwloc/lstopo + https://stackoverflow.com/questions/4087280/approximate-cost-to-access-various-caches-and-main-memory/33065382#33065382 — user3666197, Dec 18 '20 at 07:21
( if interested, `lscpu` is more relevant, the recommended `lstopo` shows this LoD: https://www.open-mpi.org/projects/hwloc/lstopo/ + documentation guide shows more details https://www.open-mpi.org/projects/hwloc/doc/hwloc-v2.4.0-a4.pdf ) — user3666197, Dec 18 '20 at 08:11

score -1 · Answer 1 · edited Apr 14 '22 at 22:55

Q : "What am I doing wrong with the joblib specification?"

The biggest sin _{( being excused by FORTRAN history, where smart uses of COMMON-blocks have an unparalleled beauty of its own )} is, that you assume a process-based Python parallelism to remain a shared-memory one, which it is not & for non-process based forms of a just-[CONCURRENT] flow of processing you assume it to perform any faster ( as if it were able to indeed escape from a central GIL-lock re-[SERIAL]-isation of any amount of thread-based code-execution back into a naive sequence of a small-time-quota driven monopolistic, pure-[SERIAL] ( concurrency thus principally avoided ) processing, which it is (due to python evangelisation reasons) not )

Q : "How can I use threading on my desktop to parallelize this task with joblib (or an alternative)?"

There is no such way for your code.

Python threading is a no-go way for your compute-intensive & heavy memory-I/O bound workloads in python.

If in a need of more reads, feel free to read this, perhaps my previous answers in this tag, and try your system NUMA-map details by using lstopo.

DISCUSSION :

As timings suggest:

serial   6.481805
parallel 6.389198

There not more than about a 1.5 % "improvement", yet there is also other O/S processes noise in that same range of "runtime" differences and only small amount of memory-I/O accesses may enjoy some meaningful latency-masking, as you operate a matrix-heavy many-MULs/many-ADDs_{(transformers)} inside the neural-network.

PRINCIPAL MISS :

_{The source of similar impacts of (not only) the initial range of value-related uncertainty was demonstrated as early as in 1972 by no one less, than a METEO guru, mathematician and meteorologist Edward N. LORENZ - in his fabulous lecture held at American Association for the Advancement of Science, 139-th meeting, right on this very day DEC-29, 1972}

Neural networks are fine for model-less (statistical-justified, as being only a least-penalised) guessing, classification of non-critical objects (where humans are soon tired or not able to see/hear a "hidden"-pattern inside devastatingly many gazilions of samples to "learn" from - otherwise, we, humans, are excellent in pattern recognition & in "learning" on-the-fly. The Mother Evolution has developed our cognitive apparatuses to do that enormously efficient (energy) & remarkably hierarchically - finding "a cat" pictured by oranges inside a pool of bananas )

Neural networks being "used" in all (known) model-driven domains are, sorry for being straight on this, is an awful sin of its own.

Sure, thermodynamic models, state-change modes, humidity/temperature/pressure/ion-interactions-specific atmosphere models are complex, yet are known & physics is not a penalty-driven guessing ( the neural-network evangelisation of many-MULs/many-ADDs_{(transformers)} are claimed to be blindfully "good" at ).

Sure, we can spend infinite HPC-budgets, infinite R&D-capacities, yet no model-less NN-driven guessing will outperform a smart, responsibly implemented physics-respecting model, within the same amount of time, energy ( yes, the HPC-infrastructure toys consume immense amounts of energy for both computing (turning it directly to dissipated heat) and cooling (turning another immense amounts of energy into cooling the exhaust-heat dissipated by the HPC-infrastructure doing any kind of the number-crunching-games (be they wise or less) in the prior step).

Last but not least, as secondary school graders should know already, MUL-s/ADD-s increase the propagation of the principal uncertainty ( not only due to the limitations of the float-IEEE-specified storage of values ). After such process the resulting uncertainty of the "result" is orders of magnitude worse than the inputs were. This is a known alphabet for HPC-computing, so needless to remind you of, yet introducing NN-many-MULs/many-ADDs_{(transformers)} into any kind of predictive systems, the less for long-range predictive systems (like the Climate evolution or the Weather near-casting) is an awful anti-pattern (even when it might get fat financing from EU agencies or from the hardware vendors (a.k.a. technology marketing) - sorry, numbers do not work this way & responsible scientists should not close our eyes from these principal gaps, if not biased cognitive manipulations, not to call them intentionally broadcast lies )

Given as trivial example as possible, take any super-trivial model-based chaotic-attractor, be it a { Duffy | Lorenz }-one,

as we "know" both the exact model (so we can compute & simulate the exact evolution in time-space with a zero-uncertainty) and its parameters, which gives us a unique chance to use these demonstrators show us, how fast the ( known, repeatable & inspectable ) solution gets devastated by a natural propagation of any and all imprecisions & uncertainties (discussed here), as we can quantitatively "show" the growing ranges of uncertainty alongside the numerical simulation which comfort we never have with unknown, empiric (the less with approximate & many-hidden degrees of freedom oversimplified) models like this

which are visually impressive, which might be captive as they look so acceptable (and we got zero-chance to review model-results against reality in time, we cannot repeat the whole reality to re-review the deltas of the model etc, so we just let others to "believe" ).

Now, let's turn for these reasons to the "known" model demonstrators, and add any tiny amount of initial data uncertainty - in position, in speed, in time-stepping ( as an abstracted coexistence of all kinds of persistently present & unavoidable observations' / readouts' systematic + random error inprecisions, in-congruent time of data-acquisition / assimilation, etc. ) and you soon get the same simulation work, but now with the "new"-dataPOINTs, yet these so fast start to bear greater and greater until soon indeed infinite ranges of their respective principally associated uncertainties ( of X, Y, Z positions, of dX/dt, dY/dt, dZ/dt speeds ), that yields them meaning less.

Is there any field of a seriously accepted science, that can make any serious use of a DataPOINT == 17.3476 ± ∞ that right the many-MULs/many-ADDs_{(transformers)} produce so insanely fast?

There are two problems here. The first one is the curse of posting in a niche tag, which is that the poor OP may only get this one answer. The second issue is that most of this is impenetrable waffle, and much of it veers into irrelevant asides. The wilful anti-formatting, distractingly ostentatious wording, and links to reams of non-specific similar material also do not help. Hopefully the OP is able to distil from this what they need, but I do not fancy their chances. — halfer, Dec 17 '20 at 01:00
Well, apart from the fact that Common blocks in Fortran have long been superseded by Modules, this post confuses prediction in a prognostic "meteorological" sense (predicting the future modelled by a set of nonlinear equation) with "prediction" as by statistical models (which I also agree is misleading, it is "diagnostic" relation). I trained a model to reproduce surface heat fluxes from WRF, as I wanted the wind and thermodynamic contribution separately and the code was too complex to "strip out" and run offline. My NN gives a r**2 of 0.999 and error of 0.05W/m2 and saved me days of time — ClimateUnboxed, Dec 17 '20 at 08:41
@AdrianTompkins (am too old to remember COMMON an unrivaled speed trick for large HPC/FEM code) **Back to Climate/Meteo:** sure are non-lin, n-th order dynamical systems (+laminar/turbulent wild jumps of Phys.props,phase-changes,gas-mixtures uncertainties,discontinuities & given you quote to be happy with model-error ~ 5E-2 [W/m2]) just too often see models with "Note that use of the emissivity angle for the flux integration can cause errors of 1 to 4 W/m2 within cloudy layers") making principal uncertainties of 1~4E+0 [W/m2], soon drive absurdities like 7.3476 ± ∞ in near term sim-ed results. — user3666197, Dec 17 '20 at 17:28
but "near term" has no meaning here as I am not integrating prognostic equations (as in your post relating to the Lorenz system), I'm simply diagnosing latent fluxes from wind, delta humidity and stability and 0.05 W/m^2 is my mean error on a completely independent dataset from the model of 2^18 data-points. — ClimateUnboxed, Dec 17 '20 at 19:51
@AdrianTompkins Hope remarks were useful. Many models just focus on a model + parametrisation (no matter what a way is used for HyperParameterSPACE search), yet not many models keep as thoroughly as the model-evolution efforts to also keep eye on a mirrored professional duty to (re)assess & (re)evaluate the principally inseparable propagation of (numerical & observation) errors, as principal uncertainties assoc'd with results. When we did nuclear reactor fatigue & other simulations, it was mandatory to deliver not only results but also to prove fields of errors/uncertainties thereof got ACK'd — user3666197, Dec 17 '20 at 20:23
{Lorenz|Duffy} are demonstrators we "know" model for, so we can "see" how fast a skewed evolution grows away from a "known" behaviour right due to an initially injected (or initially injected & slight additional time-dependent noise on model parameters) principal uncertainty. It shows how fast the error-propagation devastates any level-of-precision put into the computing efforts. Improving a component with error/uncertainty anywhere below 1E-2 simply makes no sense, until all components, bearing error/uncertainty levels ~1E+0 & all ~1E-1 got fixed. **Earning cents never saves loosing dollars** — user3666197, Dec 17 '20 at 20:32
I don't want to sound blunt, I appreciated your time taken to answer my question. But I'm a climate scientist who has spent many years developing operational meteorological forecast models; I know about error growth in nonlinear systems of equations. I just wanted to make the point that this application has nothing to do with that, there is no dx/dt term, these are not predictive equations. I understand your point exactly, small errors in the NNet can grow exponentially in time in nonlinear systems, but that has nothing to do with this application to a diagnostic, not prognostic, problem. — ClimateUnboxed, Dec 17 '20 at 22:08
Key problem is the ***"error growth"*** is ***NOT** limited only to nonlinear* systems, it is present in all systems, that "propagate" & must propagate all & any initial range of uncertainty (like material constants & all kinds of such values, present in any physical-field equation, that are principally not infinitely precise & do bear such, often remarkable ranges of *(value)* uncertainty). One can,for sure,use an NNet for XOR-emulation (as was demonstrated by neural-network propagators some 50 years ago) yet for what benefit would anyone do that (as XOR is by def DISCRETE, not so an NNet)? — user3666197, Dec 29 '20 at 14:37
The XOR-model was (those days) considered a seminal prrof/breakthrough into the application of a "many-MULs/many-ADDs" used for an InputSPACE:OutputSPACE **transformation** (NNet is in principle not much else than this),is a good parallel right due to the "visibility" of its side-effects, which we can observe because & only because we "know" the exact solution (The MODEL a XOR-function works by).That is not so easily "visible" in other domains of use (where we do not have a comfort of "knowing" the model to compare "transformed" inputs (observed) into results (predicted) against exact solution — user3666197, Dec 29 '20 at 14:45
The source of similar impacts of (not only) the initial range of *value*-related uncertainty was demonstrated as early as in 1972 by no one less, than a METEO guru, Edward LORENZ - https://fermatslibrary.com/s/predictability-does-the-flap-of-a-butterflys-wings-in-brazil-set-off-a-tornado-in-texas in his fabulous lecture held at American Association for the Advancement of Science, 139-th meeting, right on this very day DEC-29, 1972 — user3666197, Dec 29 '20 at 14:52

python parallel processing running all tasks on one core - multiprocessing, ray

Edit 1

Edit 2

Edit 3

1 Answers1