Python scikit learn n_jobs

Question

This is not a real issue, but I'd like to understand:

running sklearn from Anaconda distrib on a Win7 4 cores 8 GB system
fitting a KMeans model on a 200.000 samples*200 values table.
running with n-jobs = -1: (after adding the if __name__ == '__main__': line to my script) I see the script starting 4 processes with 10 threads each. Each process uses about 25% of the CPU (total: 100%). Seems to work as expected
running with n-jobs = 1: stays on a single process (not a surprise), with 20 threads, and also uses 100% of the CPU.

My question: what is the point of using n-jobs (and joblib) if the the library uses all cores anyway? Am I missing something? Is it a Windows-specific behaviour?

with `n_jobs=1` it uses 100% of the cpu of *one of the cores*. Each process is run in a different core. In linux with 4 cores I can clearly see the cpu usage:`(100%,~5%, ~5%, ~5%)` when I run `n_jobs=1` and `(100%, 100%, 100%, 100%)` when running with `n_jobs=-1`. Each process takes the 100% usage of a given core, but if you have `n_jobs=1` only one core is used. — Imanol Luengo, Sep 25 '15 at 13:26
Thanks for the reply. In the meantime, I have not been able to reproduce the phenomenon, so I guess it was somehow due to "something" in the state of the machine, or of the notebook. — Bruno Hanzen, Oct 03 '15 at 13:57
Interestingly, I am seeing that H2O (GBM) runs as a single process and utilizes almost 700% CPU on my 8-core machine. — arun, Jan 18 '17 at 01:38
@Luengo but it seems OMP_NUM_THREADS can also control the maximum cpu% when using sklearn.linear_model.LassoCV(n_jobs=-1) ... do you know why? (sklearn is not using OpenMP as I know) — kkkobelief24, Aug 23 '19 at 06:33

score 9 · Answer 1 · edited Nov 21 '19 at 13:51

9

what is the point of using n-jobs (and joblib) if the the library uses all cores anyway?

It does not, if you specify n_jobs to -1, it will use all cores. If it is set to 1 or 2, it will use one or two cores only (test done scikit-learn 0.20.3 under Linux).

edited Nov 21 '19 at 13:51

kenlukas

3,616
9
25
36

answered Mar 21 '19 at 19:02

Sim

103
1
6

score 7 · Answer 2 · answered Jun 01 '20 at 12:16

The documentation says:

This parameter is used to specify how many concurrent processes or threads should be used for routines that are parallelized with joblib.

n_jobs is an integer, specifying the maximum number of concurrently running workers. If 1 is given, no joblib parallelism is used at all, which is useful for debugging. If set to -1, all CPUs are used. For n_jobs below -1, (n_cpus + 1 + n_jobs) are used. For example with n_jobs=-2, all CPUs but one are used.

n_jobs is None by default, which means unset; it will generally be interpreted as n_jobs=1, unless the current joblib.Parallel backend context specifies otherwise.

For more details on the use of joblib and its interactions with scikit-learn, please refer to our parallelism notes.

score -2 · Answer 3 · edited Jan 23 '20 at 09:22

-2

You should either use n_jobs or joblib, don't use both simultaneously.

edited Jan 23 '20 at 09:22

4b0

21,981
30
95
142

answered Jan 23 '20 at 09:15

Monish

7
1

6

can you please explain why? – Kai Oct 24 '20 at 01:31

Python scikit learn n_jobs

3 Answers3

Linked