Q : How are they handled by scikit-learn?
So, lets start with the documentation, as-is 2019/4Q:
n_jobs
: int
or None
, optional ( default = None
)
Number of jobs to run in parallel. None
means 1 unless in a joblib.parallel_backend context
. -1
means using all processors. See Glossary for more details.
WHAT ALTERNATIVES DO WE HAVE HERE & WHICH IS THE BEST ONE ?
A)VOID parallelism at all
B)LOCK CPU instead of enhancing the performance
C)USTOM setup CPU-core mapping for maximum performance
D)ISTRIBUTE workloads across a cluster of Dask
-nodes
OPTION A : A)VOID parallelism at all
So, one can explicitly avoid parallelism for processing a hierarchical composition of RandomisedSearchCV(
Pipeline([(…),(…,
SGDClassifier( n_jobs=13,
…)),]),…,
n_jobs=13,
… )
for CPU-intensive, yet MEM-bound processing of the tasks by explicitly using the "threading"
-backend inside a context-constructor:
with parallel_backend( 'threading' ): # also ref.'d via sklearn.utils.parallel_backend
grid2.fit( … )
Here, whatever many threads might have got instantiated, all are but waiting for a one, central GIL-lock. All wait but one executes. This is a known GIL-lock-introduced re-[SERIAL]
-isation of any thread-based code-execution into an even more expensive, just interleaved, pure-[SERIAL]
python code execution. Except for those cases (not this one), that are principally a latency-masking trick ( for a bit better use of a time an I/O-bound tasks spend in NOPs, waiting for the I/O-operation to finish and yield any result(s) ), this will not help in getting most out of your python-based ML-pipeline any faster, but the very opposite.
OPTION B : B)LOCK CPU instead of enhancing the performance
One may choose a better, less prohibitive backend - "multiprocessing"
or in more recent joblib
-releases also "loky"
, where GIL-lock is not making us the troubles ( at a cost of instantiating n_jobs
-many python-process replicas, each one having its internal and unavoidable GIL-lock, now at least not competing one against other own threads for an execution of time-slice amount of work by first grabbing a GIL-lock ), yet the story is not finished here. This option is a typical case, when multiple-levels of n_jobs
-aware processing appears inside the same pipeline - where each of them is fighting (at the O/S-scheduler level) to get a piece of CPU-core time-slice inside a time-sharing run of more processes than a number of CPU-cores. The result? Processes were spawned, yet will have to wait in a queue for theirs turn (if more than a number of cores permissible for the user - check not only the number of cores, but also the permitted CPU-core-affinity settings, enforced by the O/S for a given user/process effective-rights, which on a tightly managed system could be way less than the number of physical (or virtualisation-emulated) of CPU-cores ), loosing time, loosing CPU-cache pre-fetched blocks of data ( so again and again spending the expensive RAM-fetches (paying ~300~350 ns
each such time), instead of re-using the pre-fetched (and already paid) data from L1/L2/L3-cache at costs of just about 0.5 ns
(!))
OPTION C : C)USTOM setup CPU-core mapping for maximum performance
A good engineering practice is to carefully map CPU-cores for processing.
Given a right backend was in place, one has to decide, where is the performance bottleneck - here, most probably ( with a chance for an exception if having an option D) equipped with a huge cluster of strong & fat-RAM machines ), one will prefer to have each and every SGDClassifier.fit()
faster ( spending more n_jobs
-specified process-instances for the most expensive sub-task - the training ), than having "more" RandomisedSearchCV()
-initiated "toys" on the playground, but suffocated by lack-of-RAM and CPU-cache-inefficiencies.
Your code will always, even behind the curtain of not knowing all details, have to "obey" to run not on all of the CPU-cores but only on those, that are permitted to be harnessed by any such multiprocessing
-requested sub-process,
the number of which is not higher than: len( os.sched_getaffinity( 0 ) )
. If interested in details, read and use a code here.
In general, a good planning and a profiling practice will help attain a best reasonably-achievable configuration of n_jobs
-instantiated processes' mapping onto available set of CPU-cores. No magic but a common sense, process-monitor and benchmarking/timing of run-times will help us in polishing this competence.
OPTION D : D)ISTRIBUTE workloads across a cluster of Dask
-nodes
Where possible, using Dask-module enabled distributed-computing nodes, one may set:
with parallel_backend( 'dask' ):
grid2.fit( … )
which will harness all Dask-cluster computing resources for making the "heavy"-task completed smarter, than possible with just a localhost-CPU/RAM resources. Ultimately, this is a maximum level of concurrent-processing possible inside the python-ecosystem in its current as-is state.
