How does scikit-learn handle multiple n_jobs arguments?

Question

I have made a pipeline in scikit-learn that looks as follows:

estimators2 = [
    ('tfidf', TfidfVectorizer(tokenizer=lambda string: string.split())),
    ('clf', SGDClassifier(n_jobs=13, early_stopping=True, class_weight='balanced'))
]
parameters2 = {
    'tfidf__min_df': np.arange(10, 30, 10),
    'tfidf__max_df': np.arange(0.75, 0.9, 0.05),
    'tfidf__ngram_range': [(1, 1), (2, 2), (3, 3)],
    'clf__alpha': (1e-2, 1e-3)
}

p2 = Pipeline(estimators2)
grid2 = RandomizedSearchCV(p2, param_distributions=parameters2, 
                           scoring='balanced_accuracy', n_iter=20, cv=3, n_jobs=13, pre_dispatch='n_jobs')

In this pipeline, there is two times the argument n_jobs? How are they handled by scikit-learn?

user3666197 · Answer 1 · 2020-04-20T15:37:35.910

Q : How are they handled by scikit-learn?

So, lets start with the documentation, as-is 2019/4Q:

n_jobs : int or None, optional ( default = None )

Number of jobs to run in parallel. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.

WHAT ALTERNATIVES DO WE HAVE HERE & WHICH IS THE BEST ONE ?

A)VOID parallelism at all
B)LOCK CPU instead of enhancing the performance
C)USTOM setup CPU-core mapping for maximum performance
D)ISTRIBUTE workloads across a cluster of Dask-nodes

OPTION A : A)VOID parallelism at all

So, one can explicitly avoid parallelism for processing a hierarchical composition of RandomisedSearchCV(Pipeline([(…),(…,SGDClassifier( n_jobs=13,…)),]),…,n_jobs=13,… ) for CPU-intensive, yet MEM-bound processing of the tasks by explicitly using the "threading"-backend inside a context-constructor:

with parallel_backend( 'threading' ):  # also ref.'d via sklearn.utils.parallel_backend
     grid2.fit( … )

Here, whatever many threads might have got instantiated, all are but waiting for a one, central GIL-lock. All wait but one executes. This is a known GIL-lock-introduced re-[SERIAL]-isation of any thread-based code-execution into an even more expensive, just interleaved, pure-[SERIAL] python code execution. Except for those cases (not this one), that are principally a latency-masking trick ( for a bit better use of a time an I/O-bound tasks spend in NOPs, waiting for the I/O-operation to finish and yield any result(s) ), this will not help in getting most out of your python-based ML-pipeline any faster, but the very opposite.

OPTION B : B)LOCK CPU instead of enhancing the performance

One may choose a better, less prohibitive backend - "multiprocessing" or in more recent joblib-releases also "loky", where GIL-lock is not making us the troubles ( at a cost of instantiating n_jobs-many python-process replicas, each one having its internal and unavoidable GIL-lock, now at least not competing one against other own threads for an execution of time-slice amount of work by first grabbing a GIL-lock ), yet the story is not finished here. This option is a typical case, when multiple-levels of n_jobs-aware processing appears inside the same pipeline - where each of them is fighting (at the O/S-scheduler level) to get a piece of CPU-core time-slice inside a time-sharing run of more processes than a number of CPU-cores. The result? Processes were spawned, yet will have to wait in a queue for theirs turn (if more than a number of cores permissible for the user - check not only the number of cores, but also the permitted CPU-core-affinity settings, enforced by the O/S for a given user/process effective-rights, which on a tightly managed system could be way less than the number of physical (or virtualisation-emulated) of CPU-cores ), loosing time, loosing CPU-cache pre-fetched blocks of data ( so again and again spending the expensive RAM-fetches (paying ~300~350 ns each such time), instead of re-using the pre-fetched (and already paid) data from L1/L2/L3-cache at costs of just about 0.5 ns(!))

OPTION C : C)USTOM setup CPU-core mapping for maximum performance

A good engineering practice is to carefully map CPU-cores for processing.

Given a right backend was in place, one has to decide, where is the performance bottleneck - here, most probably ( with a chance for an exception if having an option D) equipped with a huge cluster of strong & fat-RAM machines ), one will prefer to have each and every SGDClassifier.fit() faster ( spending more n_jobs-specified process-instances for the most expensive sub-task - the training ), than having "more" RandomisedSearchCV()-initiated "toys" on the playground, but suffocated by lack-of-RAM and CPU-cache-inefficiencies.

Your code will always, even behind the curtain of not knowing all details, have to "obey" to run not on all of the CPU-cores but only on those, that are permitted to be harnessed by any such multiprocessing-requested sub-process, the number of which is not higher than: len( os.sched_getaffinity( 0 ) ). If interested in details, read and use a code here.

In general, a good planning and a profiling practice will help attain a best reasonably-achievable configuration of n_jobs-instantiated processes' mapping onto available set of CPU-cores. No magic but a common sense, process-monitor and benchmarking/timing of run-times will help us in polishing this competence.

OPTION D : D)ISTRIBUTE workloads across a cluster of `Dask`-nodes

Where possible, using Dask-module enabled distributed-computing nodes, one may set:

with parallel_backend( 'dask' ):
     grid2.fit( … )

which will harness all Dask-cluster computing resources for making the "heavy"-task completed smarter, than possible with just a localhost-CPU/RAM resources. Ultimately, this is a maximum level of concurrent-processing possible inside the python-ecosystem in its current as-is state.

score 0 · Answer 2 · answered Feb 12 '22 at 09:51

You can try using accelerated implementations of algorithms that introduce their internal threading - such as scikit-learn-intelex - https://github.com/intel/scikit-learn-intelex

scikit-learn-intelex is based on TBB for parallelization and can utilize entire system for accelerated algorithms. So you would be able to go without joblib parallelization while still getting better performance.

First install package

pip install scikit-learn-intelex

And then add in your python script

from sklearnex import patch_sklearn
patch_sklearn()

How does scikit-learn handle multiple n_jobs arguments?

2 Answers2

WHAT ALTERNATIVES DO WE HAVE HERE & WHICH IS THE BEST ONE ?

OPTION A : A)VOID parallelism at all

OPTION B : B)LOCK CPU instead of enhancing the performance

OPTION C : C)USTOM setup CPU-core mapping for maximum performance

OPTION D : D)ISTRIBUTE workloads across a cluster of `Dask`-nodes

Linked

How does scikit-learn handle multiple n_jobs arguments?

2 Answers2

WHAT ALTERNATIVES DO WE HAVE HERE & WHICH IS THE BEST ONE ?

OPTION A : A)VOID parallelism at all

OPTION B : B)LOCK CPU instead of enhancing the performance

OPTION C : C)USTOM setup CPU-core mapping for maximum performance

OPTION D : D)ISTRIBUTE workloads across a cluster of Dask-nodes

Linked

OPTION D : D)ISTRIBUTE workloads across a cluster of `Dask`-nodes