Is there a quicker way of running GridsearchCV

Question

I'm optimizing some paramters for an SVC in sklearn, and the biggest issue here is having to wait 30 minutes before I try out any other parameter ranges. Worse is the fact that I'd like to try more values for c and gamma within the same range (so I can create a smoother surface plot) but I know that it will just take longer and longer... When I ran it today I changed the cache_size from 200 to 600 (without really knowing what it does) to see if it made a difference. The time decreased by about a minute.

Is this something I can help? Or am I just gonna have to deal with a very long time?

clf = svm.SVC(kernel="rbf" , probability = True, cache_size = 600)

gamma_range = [1e-7,1e-6,1e-5,1e-4,1e-3,1e-2,1e-1,1e0,1e1]
c_range = [1e-3,1e-2,1e-1,1e0,1e1,1e2,1e3,1e4,1e5]
param_grid = dict(gamma = gamma_range, C = c_range)

grid = GridSearchCV(clf, param_grid, cv= 10, scoring="accuracy")
%time grid.fit(X_norm, y)

returns:

Wall time: 32min 59s

GridSearchCV(cv=10, error_score='raise',
   estimator=SVC(C=1.0, cache_size=600, class_weight=None, coef0=0.0, degree=3, gamma=0.0,
kernel='rbf', max_iter=-1, probability=True, random_state=None,
shrinking=True, tol=0.001, verbose=False),
   fit_params={}, iid=True, loss_func=None, n_jobs=1,
   param_grid={'C': [0.001, 0.01, 0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0, 100000.0], 'gamma': [1e-07, 1e-06, 1e-05, 0.0001, 0.001, 0.01, 0.1, 1.0, 10.0]},
   pre_dispatch='2*n_jobs', refit=True, score_func=None,
   scoring='accuracy', verbose=0)

score 31 · Accepted Answer · answered Feb 26 '16 at 16:04

A few things:

10-fold CV is overkill and causes you to fit 10 models for each parameter group. You can get an instant 2-3x speedup by switching to 5- or 3-fold CV (i.e., cv=3 in the GridSearchCV call) without any meaningful difference in performance estimation.
Try fewer parameter options at each round. With 9x9 combinations, you're trying 81 different combinations on each run. Typically, you'll find better performance at one end of the scale or the other, so maybe start with a coarse grid of 3-4 options, and then go finer as you start to identify the area that's more interesting for your data. 3x3 options means a 9x speedup vs. what you're doing now.
You can get a trivial speedup by setting njobs to 2+ in your GridSearchCV call so you run multiple models at once. Depending on the size of your data, you may not be able to increase it too high, and you won't see an improvement increasing it past the number of cores you're running, but you can probably trim a bit of time that way.

Setting njobs to -1 will create 1 job per core automatically. Depending on your model, memory might then become an issue, but usually not! — Ken Syme, Nov 01 '17 at 17:26

score 8 · Answer 2 · edited Dec 26 '21 at 05:24

Also you could set probability=False inside of SVC estimator to avoid applying expensive Platt's calibration internally. (If having ability to run predict_proba is crucial, perform GridSearchCv with refit=False, and after picking best paramset in terms of model's quality on test set just retrain best estimator with probability=True on whole training set.)

Another step would be to use RandomizedSearchCV instead of GridSearchCV, which would allow you to reach better model quality at roughly the same time (as controlled by n_iters parameter).

And, as already mentioned, use n_jobs=-1

score 4 · Answer 3 · answered Feb 14 '21 at 21:39

Adding to the other answers (like not using 10-fold CV and using fewer parameter options each round), there are other ways you can speed up your model.

Parallelize your code

Randy mentioned that can use n_jobs to parallelize your taining (this is based on the number of cores on your computer). The only difference with the code below is it uses n_jobs = -1 which creates 1 job per core automatically. So if you have 4 cores, it will try to utilize all 4 cores. The code below is run on an 8 core computer. It took 18.3 seconds with n_jobs = -1 on my computer as opposed to 2 minutes 17 seconds without.

import numpy as np
from sklearn import svm
from sklearn import datasets
from sklearn.model_selection import GridSearchCV
rng = np.random.RandomState(0)
X, y = datasets.make_classification(n_samples=1000, random_state=rng)


clf = svm.SVC(kernel="rbf" , probability = True, cache_size = 600)

gamma_range = [1e-7,1e-6,1e-5,1e-4,1e-3,1e-2,1e-1,1e0,1e1]
c_range = [1e-3,1e-2,1e-1,1e0,1e1,1e2,1e3,1e4,1e5]
param_grid = dict(gamma = gamma_range, C = c_range)

grid = GridSearchCV(clf, param_grid, cv= 10, scoring="accuracy", n_jobs = -1)
%time grid.fit(X, y)

Note that if you have access to a cluster, you can distribute your training with Dask or Ray.

Different Hyperparameter Optimization Techniques

Your code uses GridSearchCV which is an exhaustive search over specified parameter values for an estimator. Scikit-Learn also has RandomizedSearchCV which samples a given number of candidates from a parameter space with a specified distribution. Using randomized search for the code example below took 3.35 seconds.

import numpy as np

from sklearn import svm
from sklearn import datasets
from sklearn.model_selection import RandomizedSearchCV

rng = np.random.RandomState(0)
X, y = datasets.make_classification(n_samples=1000, random_state=rng)


clf = svm.SVC(kernel="rbf" , probability = True, cache_size = 600)

gamma_range = [1e-7,1e-6,1e-5,1e-4,1e-3,1e-2,1e-1,1e0,1e1]
c_range = [1e-3,1e-2,1e-1,1e0,1e1,1e2,1e3,1e4,1e5]
param_grid = dict(gamma = gamma_range, C = c_range)

grid = RandomizedSearchCV(clf, param_grid, cv= 10, scoring="accuracy", n_jobs = -1)
%time grid.fit(X, y)

Image from documentation.

Recently (scikit-learn 0.24.1 January 2021), scikit-learn added the experimental hyperparameter search estimators halving grid search (HalvingGridSearchCV) and halving random search (HalvingRandomSearch). These techniques can be used to search the parameter space using successive halving. The image above shows that all hyperparameter candidates are evaluated with a small number of resources at the first iteration and the more promising candidates are selected and given more resources during each successive iteration.You can use it by upgrading your scikit-learn (pip install --upgrade scikit-learn)

Is there a quicker way of running GridsearchCV

3 Answers3

Linked