How to parallelise .predict() method of a scikit-learn SVM (SVC) Classifier?

Question

I recently came across a requirement that I have a .fit() trained scikit-learn SVC Classifier instance and need to .predict() lots of instances.

Is there a way to parallelise only this .predict() method by any scikit-learn built-in tools?

from sklearn import svm

data_train = [[0,2,3],[1,2,3],[4,2,3]]
targets_train = [0,1,0]

clf = svm.SVC(kernel='rbf', degree=3, C=10, gamma=0.3, probability=True)
clf.fit(data_train, targets_train)

# this can be very large (~ a million records)
to_be_predicted = [[1,3,4]]
clf.predict(to_be_predicted)

If somebody does know a solution, I will be more than happy if you could share it.

score 6 · Answer 1 · answered Jun 09 '20 at 13:42

6

Working example from above...

from joblib import Parallel, delayed
from sklearn import svm

data_train = [[0,2,3],[1,2,3],[4,2,3]]
targets_train = [0,1,0]

clf = svm.SVC(kernel='rbf', degree=3, C=10, gamma=0.3, probability=True)
clf.fit(data_train, targets_train)

to_be_predicted = np.array([[1,3,4], [1,3,4], [1,3,5]])
clf.predict(to_be_predicted)

n_cores = 3

parallel = Parallel(n_jobs=n_cores)
results = parallel(delayed(clf.predict)(to_be_predicted[i].reshape(-1,3))
    for i in range(n_cores))

np.vstack(results).flatten()

array([1, 1, 0])

answered Jun 09 '20 at 13:42

Jonathan

1,287
14
17

Thanks, I found this example very helpful. I think it doesn't quite answer the question though as to_be_predicted[i] would only calculate three entries, fine for your example, but the OP will have a list millions long and as written this solution only calculates the first three entries... – ClimateUnboxed Dec 16 '20 at 21:30
...or did I misunderstand? – ClimateUnboxed Dec 16 '20 at 21:36

score 5 · Answer 2 · answered Jul 16 '15 at 18:20

This may be buggy, but something like this should do the trick. Basically, break your data into blocks and run your model on each block separately in a joblib.Parallel loop.

from sklearn.externals.joblib import Parallel, delayed

n_cores = 2
n_samples = to_be_predicted.shape[0]
slices = [
    (n_samples*i/n_cores, n_samples*(i+1)/n_cores))
    for i in range(n_cores)
    ]

results = np.vstack( Parallel( n_jobs = n_cores )( 
    delayed(clf.predict)( to_be_predicted[slices[i_core][0]:slices[i_core][1]
    for i_core in range(n_cores)
    ))

The `results` row doesnt run on my machine. – Ladenkov Vladislav Aug 31 '17 at 16:04 — Ladenkov Vladislav, Aug 31 '17 at 16:04

How to parallelise .predict() method of a scikit-learn SVM (SVC) Classifier?

2 Answers2

Linked