Python joblib - Running parallel code within parallel code

Question

I am working on a project that I intend to make more efficient by using a parallel function with joblib with shared memory.

However, I also intend to conduct a parametric study on the program by running the process a large number of times with different parameters (i.e. without shared memory).

I was wondering if this was doable in Python/joblib.

Edit : _2020-06-19

As another user mentioned, I should, I will clarify what in my code I wanted to parallelized. Essentially I have a 3D numpy array representing some physical space, which I populate with a large number of truncated Gaussians (only affecting a finite number of elements). Full vectorization was not found to particularly speed up the code due to the bottleneck being memory access, and I wanted to try parallelization, since I iterate over all ith-Gaussian centers and add its contributions to the overall field. (These loops will share variables to an extent)

The idea of parallel code within parallel code coming up is that I will also want to run a large number of such processes simultaneously using cluster accessed online in order to conduct a parametric study of the overall performance of the project as a whole in regards to an unspecified metric. Thus these loops would be fully independent.

A modified excerpt of the inner loop is posted here. Unfortunately it does not seem to speed up performance, and in the case where I do not split up the list of Gaussian centers into two arrays for each core it is even worse, and I am currently investigating this.

import numpy as np
import time
from joblib import Parallel, delayed, parallel_backend
from extra_fns import *

time.perf_counter()
nj = 2
set_par = True
split_var = True

# define 3d grid
nd = 3
nx = 250
ny = 250
nz = 250
x = np.linspace(0, 1, nx)
y = np.linspace(0, 1, ny)
z = np.linspace(0, 1, nz)

# positions of gaussians in space
pgrid = np.linspace(0.05, 0.95 , 20)
Xp, Yp, Zp = np.meshgrid(pgrid,pgrid,pgrid)
xp = Xp.ravel()
yp = Yp.ravel()
zp = Zp.ravel()
Np = np.size(xp)
s = np.ones(Np) # intensity of each gaussian
# compact gaussian representation
sigma = x[1]-x[0]
max_dist = sigma*(-2*np.log(10e-3))

# 3D domain: 
I = np.zeros((ny, nx, nz))
dx = x[1] - x[0]
dy = y[1] - y[0]
dz = z[1] - z[0]


dix = np.ceil(max_dist/dx)
diy = np.ceil(max_dist/dy)
diz = np.ceil(max_dist/dz)

def run_test(set_par, split_var, xp, yp, zp, s):
    def add_loc_gaussian(i):
        ix = round((xp[i] - x[0]) / dx)
        iy = round((yp[i] - y[0]) / dy)
        iz = round((zp[i] - z[0]) / dz)
        iix = np.arange(max(0, ix - dix), min(nx, ix + dix), 1, dtype=int)
        iiy = np.arange(max(0, iy - diy), min(ny, iy + diy), 1, dtype=int)
        iiz = np.arange(max(0, iz - diz), min(nz, iz + diz), 1, dtype=int)
        ddx = dx * iix - xp[i]
        ddy = dy * iiy - yp[i]
        ddz = dz * iiz - zp[i]
        gx = np.exp(-1 / (2 * sigma ** 2) * ddx ** 2)
        gy = np.exp(-1 / (2 * sigma ** 2) * ddy ** 2)
        gz = np.exp(-1 / (2 * sigma ** 2) * ddz ** 2)
        gx = gx[np.newaxis,:, np.newaxis]
        gy = gy[:,np.newaxis, np.newaxis]
        gz = gz[np.newaxis, np.newaxis, :]
        I[np.ix_(iiy, iix, iiz)] += s[i] * gy*gx*gz

    if set_par and split_var: # case 1
        mp = int(Np/nj) # hard code this test fn for two cores
        xp_list = [xp[:mp],xp[mp:]]
        yp_list = [yp[:mp],yp[mp:]]
        zp_list = [zp[:mp],zp[mp:]]
        sp_list = [s[:mp],s[mp:]]

        def core_loop(j):
            xpt = xp_list[j]
            ypt = yp_list[j]
            zpt = zp_list[j]
            spt = sp_list[j]

            def add_loc_gaussian_s(i):
                ix = round((xpt[i] - x[0]) / dx)
                iy = round((ypt[i] - y[0]) / dy)
                iz = round((zpt[i] - z[0]) / dz)
                iix = np.arange(max(0, ix - dix), min(nx, ix + dix), 1, dtype=int)
                iiy = np.arange(max(0, iy - diy), min(ny, iy + diy), 1, dtype=int)
                iiz = np.arange(max(0, iz - diz), min(nz, iz + diz), 1, dtype=int)
                ddx = dx * iix - xpt[i]
                ddy = dy * iiy - ypt[i]
                ddz = dz * iiz - zpt[i]
                gx = np.exp(-1 / (2 * sigma ** 2) * ddx ** 2)
                gy = np.exp(-1 / (2 * sigma ** 2) * ddy ** 2)
                gz = np.exp(-1 / (2 * sigma ** 2) * ddz ** 2)
                gx = gx[np.newaxis, :, np.newaxis]
                gy = gy[:, np.newaxis, np.newaxis]
                gz = gz[np.newaxis, np.newaxis, :]
                I[np.ix_(iiy, iix, iiz)] += spt[i] * gy * gx * gz

            for i in range(np.size(xpt)):
                add_loc_gaussian_s(i)

        Parallel(n_jobs=2, require='sharedmem')(delayed(core_loop)(i) for i in range(2))

    elif set_par: # case 2
        Parallel(n_jobs=nj, require='sharedmem')(delayed(add_loc_gaussian)(i) for i in range(Np))

    else: # case 3
        for i in range(0,Np):
            add_loc_gaussian(i)

run_test(set_par, split_var, xp, yp, zp, s)
print("Time taken: {} s".format(time.perf_counter()))

It's just a simple conceptual question. There are other similar questions such as https://stackoverflow.com/questions/26707177/parallel-within-parallel-code, and I am simply asking the same question in regards to specifically Python's capabilities. — George, Jun 17 '20 at 20:17

user3666197 · Answer 1 · 2020-06-18T17:09:12.257

"... doable in Python/joblib ..."

There is no problem with conceptual intent, yet ...

"... I intend to make more efficient ..."

this is the hardest part of the story.

Why ?

CPU microoperation NOP ( do nothing ) takes ~ 0.1 [ns] in 2020/2H.

CPU microoperations take about ~ 0.3 [ns] ADD/SUB, ~ 10 [ns] DIV in 2020/2H.

CPU can have more than one core and CISC architectures can operate a pair of hardware threads on each of such CPU core.

CPU can evolve, will evolve, yet will not do any magic leapfrog "jumps" beyond the reality of constraints put into the game by the laws of physics. Never.

CPU can be scheduled by the O/S scheduler to interleave many more software threads ( streams of code-execution ) as this interleaved code-executions generates for us, slow, with about 25-Hz cortex visual sampling comprehension, using not more than only one (voice) or two-handed input "devices", an illusion of multi-tasking operating system, yet all such work is just sufficiently enough (no warranties for non Real-Time (HRT) operating systems) put into a few pairs of CPU-core threads.

CPU can achieve the most efficient processing, if the computing tasks are not wildly interleaved. The less the better.

CPU will in such "compact" work-stream orchestration remain within those about ~ 0.3 ~ 10 [ns] per uop (a CPU hardware machine instruction) and will best compute, if not going anywhere else for data but into its own hardware registers ( L1 cache "costs" ~ 0.5 [ns] to fetch data from, whereas L2 is ~ 8x more "expensive", L3 is ~ 40x more expensive and RAM can go anywhere from ~70 .. 3++ [ns] to fetch data from ). Interleaved process-executions are thus paying a lot of overhead costs just to re-instate data many times pre-fetched from expensive RAM into less expensive L3, L2 and L1 cache storage ( just re-paying the costs of ~ 300 ~ 350 [ns] each time a piece of data is to get re-fetched since interleaved process-execution does not retain once pre-fetched data after scheduler removed this thread from CPU-core so as to make a space-time for executing another one in the scheduler queue ).

CPU can do its best if not waiting for data from RAM ( memory channels and I/O-bottlenecks are known HPC-efficiency / CPU-starvation enemies for ages ).

These hardware overheads are not "enough", you will have to pay way more :

Python/joblib.Parallel()delayed() constructor is trivial to type, not so for fine-tuning the performance towards a maximum efficiency.

Using a default value of njobs ( or any naive manual setting ) may and will most often decrease the actual efficiency of the processing way under the CPU-hardware performance limits.

There are non-zero add-on costs, that the joblib-spawned processes have to pay. In all cases these pay CPU-hardware add-on costs for each data item re-fetched back from RAM again into the (now O/S scheduler re-selected) CPU-core L3/L2/L1-cache (those ~ 3++ [ns] hundreds nanoseconds) plus it "shares" a weakly prioritised share of the CPU-core code-execution time ( ref. O/S parametrisation and scheduler properties for details on settings for maximum performance / efficiency )
and
last-but-not-least
there are immense ( in scale ~ hundreds of [us] if not [ms] ) add-on costs for the process-instantiation, process-to-process call-signature parameters' transfer ( read SER/DES costs ( often the pickle.dumps() / pickle.loads() ) on parameters' data-transformation + Process-2-Process compressed-data communication exchange_{...( time, time, time... )...}, process-results' data transfer back (if present), i.e. again the SER/DES-pipeline + P2P-communication costs _{...( time, time, time... )...} plus the process-termination add-on costs.

Doing all this anywhere near the CPU-hardware top performance capabilities ceilings is always hard, the more in the relaxed & diverse ecosystem where Python-GIL-lock restricted code-execution + joblib.Parallel()-spawned processes + Cython-ised modules (where you need not have the comfort of controlling/tuning the actual number of their spawned sub-processes, do you?) coexist and this already "efficiency"-tuning wild mix is let being operated inside an ordinary, COTS-grade user-MMI-focused O/S.

If all above was easy to "swallow", the add-on costs of sharing come next :

While the shared-variables protocols exist, I would take whatever measures it takes to avoid paying immense code-execution add-on costs for "using" them.

Who would pay the immense costs to "rent" a Rolls-Royce just for a lazy ride to school at 9:00 AM and to return sometimes in the late afternoon? Doable, yet an immensely expensive "strategy". There are for sure ways to avoid shared-variables and zero-sharing is a must for any HPC-grade software that aims to top-performance with efficiency in mind.

"I am working on a project..."

Paying the bill only after `XY-[man*months]` of burnt efforts may be too late :

Analyze the processing-strategy and all the actual add-on costs a priori the decision.

Late surprises are the most expensive ones.

Even the original, overhead-naive & atomicity-of-work ignoring Amdahl's Law shows, that there is a principal limit - a law of diminishing returns - that you can never circumvent. And that was the add-on costs ignoring optimistic model.

The reality works against your will to get improved performance. The more if macroscopic (multi-processing related) add-on costs take (and they very soon do) dominance. Adding shared-variable communication protocols makes the efficiency many orders of magnitude worse (not only ~2 orders of magnitude latency costs added for cache/RAM re-fetches, but process-to-process re-synchronisation costs "block" free-flow of the most efficient CPU-core camped processing, as dependencies on other off-CPU-core processes arise barrier-like blocking states, when shared-variable's state is checked/re-propagated so as to maintain a system-wide consistency ... at the costs of loosing time and efficiency going to turn wreck havoc ... just for the ease of syntax-sugar of Python shared-variable "comfort".

Yes, the real life is cruel ...

_{But who has ever told us it isn't?

Good luck with managing the Project safe & right!

Learning the Art of Life, going the hard-way forwards, is always enriching us, isn't it?

:o)}

May like to read more on this,
perhaps with code-examples :

If interested in further readings on joblib & revised-Amdahl's Law impacts, feel free to dive in. The Devil is hidden in the details. As always.

I see, thank you for the detailed response, I'll look further into the resources you provided as I wasn't sure how complicated this would be when I first dove in. In regards to the "why", I modified my submission to give details about what each parallel loop was required for. — George, Jun 18 '20 at 22:59

Python joblib - Running parallel code within parallel code

Edit : 2020-06-19

1 Answers1