51

I tried to use SVM classifier to train a data with about 100k samples, but I found it to be extremely slow and even after two hours there was no response. When the dataset has around 1k samples, I can get the result immediately. I also tried SGDClassifier and naïve bayes which is quite fast and I got results within couple of minutes. Could you explain this phenomena?

demongolem
  • 9,474
  • 36
  • 90
  • 105
C. Gary
  • 525
  • 1
  • 4
  • 4
  • See also: [SVM using scikit learn runs endlessly and never completes execution](https://datascience.stackexchange.com/q/989/8820) – Martin Thoma Nov 24 '18 at 08:01

2 Answers2

72

General remarks about SVM-learning

SVM-training with nonlinear-kernels, which is default in sklearn's SVC, is complexity-wise approximately: O(n_samples^2 * n_features) link to some question with this approximation given by one of sklearn's devs. This applies to the SMO-algorithm used within libsvm, which is the core-solver in sklearn for this type of problem.

This changes much when no kernels are used and one uses sklearn.svm.LinearSVC (based on liblinear) or sklearn.linear_model.SGDClassifier.

So we can do some math to approximate the time-difference between 1k and 100k samples:

1k = 1000^2 = 1.000.000 steps = Time X
100k = 100.000^2 = 10.000.000.000 steps = Time X * 10000 !!!

This is only an approximation and can be even worse or less worse (e.g. setting cache-size; trading-off memory for speed-gains)!

Scikit-learn specific remarks

The situation could also be much more complex because of all that nice stuff scikit-learn is doing for us behind the bars. The above is valid for the classic 2-class SVM. If you are by any chance trying to learn some multi-class data; scikit-learn will automatically use OneVsRest or OneVsAll approaches to do this (as the core SVM-algorithm does not support this). Read up scikit-learns docs to understand this part.

The same warning applies to generating probabilities: SVM's do not naturally produce probabilities for final-predictions. So to use these (activated by parameter) scikit-learn uses a heavy cross-validation procedure called Platt scaling which will take a lot of time too!

Scikit-learn documentation

Because sklearn has one of the best docs, there is often a good part within these docs to explain something like that (link):

enter image description here

sascha
  • 32,238
  • 6
  • 68
  • 110
  • 4
    So, for users who have bunch of data, scikit-learn is not the best choice. I came this issue too. 800K examples, and it costs me 2 hours. – GoingMyWay Dec 10 '17 at 06:44
  • 2
    @GoingMyWay, so does there exists a faster alternative? – Riley Jul 03 '19 at 06:47
  • 3
    @GoingMyWay I think that's a misunderstanding of the answer. The time complexity of the SVM algorithm with kernels is a general fact, independent of which package you use. It's inherent in using an SVM model, whether that's from sklearn or something in R. Unless you know of an algorithm for optimizing SVM parameters that magically improves upon this and that hasn't been implemented in sklearn yet, you won't gain anything by using another package. Regarding SVC, again, "one-vs-rest" or the alternatives are inherently what you need to do to use an SVM with multiple classes. – Marses Nov 19 '19 at 21:46
  • 1
    @GoingMyWay It sounds like your issue is perhaps that you think using SVMs with a kernel is too slow, but that's not a problem with sklearn. sklearn just implements the algorithm, if the algorithm performs poorly in your case, it's because you chose the wrong algorithm. I'd be interested in finding out if you managed to find something without the drawbacks mentioned in the answer in the time since you made that comment. – Marses Nov 19 '19 at 21:49
  • The number one takeaway: rbf is the default kernel. For a first pass (and maybe even a final solution depending upon your problem) linear is probably what you want to go with which will have a large time savings. I personally would prefer the user be made to specify the kernel parameter instead of it having a default, but there are arguments against that and I hold no sway in the the development of that library. – demongolem Mar 11 '20 at 15:18
3

If you are using intel CPU then Intel has provided the solution for it. Intel Extension for Scikit-learn offers you a way to accelerate existing scikit-learn code. The acceleration is achieved through patching: replacing the stock scikit-learn algorithms with their optimized versions provided by the extension. You should follow the following steps:

First install intelex package for sklearn

pip install scikit-learn-intelex

Now just add the following line in the top of the program

from sklearnex import patch_sklearn 

patch_sklearn()

Now run the program it will be much faster than before.

You can read more about it from the following link: https://intel.github.io/scikit-learn-intelex/

Antoine
  • 1,393
  • 4
  • 20
  • 26
AsadMajeed
  • 55
  • 6