5

I'm attempting to do a grid search to optimize my model but it's taking far too long to execute. My total dataset is only about 15,000 observations with about 30-40 variables. I was successfully able to run a random forest through the gridsearch which took about an hour and a half but now that I've switched to SVC it's already ran for over 9 hours and it's still not complete. Below is a sample of my code for the cross validation:

from sklearn.model_selection import GridSearchCV
from sklearn import svm
from sklearn.svm import SVC

SVM_Classifier= SVC(random_state=7)



param_grid = {'C': [0.1, 1, 10, 100],
              'gamma': [1,0.1,0.01,0.001],
              'kernel': ['linear', 'poly', 'rbf', 'sigmoid'],
              'degree' : [0, 1, 2, 3, 4, 5, 6]}

grid_obj = GridSearchCV(SVM_Classifier,
                        
                        return_train_score=True,
                        param_grid=param_grid,
                        scoring='roc_auc',
                        cv=3,
                       n_jobs = -1)

grid_fit = grid_obj.fit(X_train, y_train)
SVMC_opt = grid_fit.best_estimator_

print('='*20)
print("best params: " + str(grid_obj.best_estimator_))
print("best params: " + str(grid_obj.best_params_))
print('best score:', grid_obj.best_score_)
print('='*20)

I have already reduced the cross validation from 10 to 3, and I'm using n_jobs=-1 so I'm engaging all of my cores. Is there anything else I'm missing that I can do here to speed up the process?

Benjamin Diaz
  • 141
  • 1
  • 10
  • 1
    Grid Search sadly from My experience will always take a long time. But, there are ways to speed it up. – Flow May 03 '22 at 14:54
  • https://stackoverflow.com/questions/35655701/is-there-a-quicker-way-of-running-gridsearchcv – Flow May 03 '22 at 14:56
  • 1
    thanks, but I've already referred to this post. Hence why I already set n_jobs to -1 and reduced cv to 3 instead of 10. I'm not sure if there's more that I can do outside of setting up an EC2 instance and trying to run it on the cloud. I'm trying to keep it native – Benjamin Diaz May 03 '22 at 15:02
  • Ok, I understand. – Flow May 03 '22 at 15:27

2 Answers2

3

Unfortunately, SVC's fit algorithm is O(n^2) at best, so it indeed is extremely slow. Even the documentation suggests to use LinearSVC above ~10k samples and you are right in that ballpark.

Maybe try to increase the kernel cache_size. I would suggest timing a single SVC fit with different cache sizes to see whether you can gain something.

EDIT: by the way, you are needlessly computing a lot of SVC fits with different degree parameter values, where that will be ignored (all the kernels but poly). I suggest splitting the runs for poly and the other kernels, you will save a lot of time.

user2246849
  • 4,217
  • 1
  • 12
  • 16
  • Would RandomizedSearchCV be a better method for this since the complexity of the algorithm is exponential ? – Benjamin Diaz May 03 '22 at 15:30
  • That's also an option, yes. But in any case make sure you are not optimizing hyperparameters that are not related to some kernels in the param_grid. In that case, split to multiple gridsearches – user2246849 May 03 '22 at 15:37
  • And also, I would time a single fit just to have an idea of how long it takes and to try different kernel caches – user2246849 May 03 '22 at 15:38
2

While exploring LinearSVC might be a good choice (and you should clean up the parameter combinations as noted in the other answer), you could also use a GPU accelerated SVC estimator in RAPIDS cuML on a GPU-enabled cloud instance of your choice (or locally if you have an NVIDIA GPU). This estimator can be dropped directly into your GridSearchCV function if you use the default n_jobs=1. (Disclaimer: I work on this project).

For example, I ran the following on my local machine [0]:

import sklearn.datasets
import cuml
from sklearn.svm import SVC

X, y = sklearn.datasets.make_classification(n_samples=15000, n_features=30)
%timeit _ = SVC().fit(X, y).predict(X)
%timeit _ = cuml.svm.SVC().fit(X, y).predict(X)
8.68 s ± 64.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
366 ms ± 1.26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

[0] System

  • CPU: Intel(R) Xeon(R) Gold 6128 CPU @ 3.40GHz, CPU(s): 12
  • GPU: Quadro RTX 8000
Nick Becker
  • 4,059
  • 13
  • 19