Where to set n_job: estimator or GridSearchCV?

Question

I often use GridSearchCV for hyperparameter tuning. For example, for tuning regularization parameter C in Logistic Regression. Whenever an estimator I am using has its own n_jobs parameter I am confused where to set it, in estimator or in GridSearchCV, or in both? Same thing applies to cross_validate.

@Anwarvic this is not what is asked; question is about a limited number of models which indeed include an `n_jobs` argument, and how it is to be used in conjunction with `GridSearchCV`, which has its own, too. — desertnaut, May 27 '20 at 11:43
I understand. The question is about when the estimator does have this parameter, so does GridSearchCV. Where one should I choose? — Aramakus, May 27 '20 at 11:43
Only time I did a [rough experiment](https://stackoverflow.com/questions/61687800/retrieving-specific-classifiers-and-data-from-gridsearchcv/61689871#61689871) with k-nn, turned out that setting this for GridSearchCV only resulted in a *much* faster process. I guess it makes some sense, but don't have the time to elaborate (and that's why I post this just as a comment). — desertnaut, May 27 '20 at 11:49

score 7 · Accepted Answer · answered May 28 '20 at 09:55

This is a very interesting question. I don't have a definitive answer, but some elements that are worth mentioning to understand the issue, and don't fir in a comment.

Let's start with why you should or should not use multiprocessing :

Multiprocessing is useful for independent tasks. This is the case in a GridSearch, where all your different variations of your models are independent.
Multiprocessing is not useful / make things slower when :
- Task are too small : creating a new process takes time, and if your task is really small, this overhead with slow the execution of the whole code
- Too many processes are spawned : your computer have a limited number of cores. If you have more processes than cores, a load balancing mechanism will force the computer to regularly switch the processes that are running. These switches take some time, resulting in a slower execution.

The first take-out is that you should not use n_jobs in both GridSearch and the model you're optimizing, because you will spawn a lot of processes and end up slowing the execution.

Now, a lot of sklearn models and functions are based on Numpy/SciPy which in turn, are usually implemented in C/Fortran, and thus already use multiprocessing. That means that these should not be used with n_jobs>1 set in the GridSearch.

If you assume your model is not already parallelized, you can choose to set n_jobsat the model level or at the GridSearch level. A few models are able to be fully parallelized (RandomForest for instance), but most may have at least some part that is sequential (Boosting for instance). In the other end, GridSearch has no sequential component by design, so it would make sense to set n_jobs in GridSearch rather than in the model.

That being said, it depend on the implementation of the model, and you can't have a definitive answer without testing for yourself for your case. For example, if you pipeline consume a lot of memory for some reason, setting n_jobs in the GridSearch may cause memory issues.

As a complement, here is a very interesting note on parallelism in sklearn

but what if i say if we don't apply n_jobs=-1 in RF it will take a lot of time to run. In that case if we give in both place would it create problem ? — yogesh agrawal, Nov 17 '22 at 17:27

Where to set n_job: estimator or GridSearchCV?

1 Answers1