Is there any recommended strategy to run gridsearchcv on a huge data set?

Question

I am trying to evaluate svm on a huge dataset of size ~.3 million records. This is a multiclass problem with 23 features. Currently gridsearchcv takes ages to iterate parameters. Is there any strategy to speed this up? I guess .3 million records is a reasonable number and I am perplexed that the CPU usage doesn't go beyond 30% and RAM usage is limited to 50 %. I had the n_jobs set to -1 and pre_dispatch=1 as suggested in documentation. Nothing changes. With my inputs I am expecting a total of 24 iterations. Here is my sample code

from sklearn.multiclass import OneVsRestClassifier
from sklearn.grid_search import GridSearchCV
from sklearn import svm
model_to_set = OneVsRestClassifier(svm.SVC())

parameters = {
    "estimator__C": [1,2,4,8],
    "estimator__kernel": ["poly","rbf","linear"],
    "estimator__degree":[1, 2, 3, 4],
}

model_tunning = GridSearchCV(model_to_set, param_grid=parameters,n_jobs=-1,pre_dispatch=1,
                             scoring='f1')

model_tunning.fit(mat[1:23], mat[0])

Appreciate any help.

*"the CPU usage doesn't go beyond 30%"* - is your Python process using multiple cores? — ali_m, Mar 09 '16 at 20:21
yes, n_jobs= -1 wouldn't take care of that ? or is there anything else do I need to do for this ? — Fremzy, Mar 10 '16 at 22:47
Have you actually watched your CPU utilisation while the process is running, e.g. using `htop`? It's possible for certain modules to interfere with CPU affinity (e.g. http://stackoverflow.com/a/15641148/1461210). — ali_m, Mar 10 '16 at 22:53
When I run the same code with random forest classifier, my utilization goes to ~100% as I can see in task manager, so place I am looking at shouldn't be the issue. — Fremzy, Mar 11 '16 at 03:15

Is there any recommended strategy to run gridsearchcv on a huge data set?

0 Answers0