7

I am executing the scikit-learn SVM classifier (SVC) from python 2.7.10 and it has been running for over 2 hours now. I have read the data using pandas.read_csv preprocessed it and then run

clf = SVC(C = 0.001, kernel = 'linear', cache_size = 7000, verbose = True)
clf.fit(X_train, y_train)

I have experience in running classifiers (Random Forests and Deep Neural Networks) in H2O using R and they never take this long! The machine I am running on has 16 GB RAM and and i7 with 3.6 GHz on each core. The taskmonitor tells me that 8.6 Gb RAM are being used by python, however, only 13% of the CPU. I don't quite understand why it is so slow and not even using all resources.

The data I have has 12000000 rows and 22 columns and the only verbose sklearn is giving me is one line:

[LibSVM]

Is that normal behavior or should I see a lot more? Could anyone post the verbose of a svc that finished? Also, can I do anything to speed things up besides lowering the C parameter? Using less rows is not really an option since I want to benchmark algorithms and they wouldn't be comparable if different training data was used. Finally, can anyone explain why so little of my resources are being used?

  • 1
    My guess would be that libsvm is using just one core and since you have an i7 you are just seing a tiny part of use. Try checking the individual thread/core user to validate this theory. – Pedrom Jul 21 '15 at 22:44
  • Even then the PC has 4 cores so it could at least use 25% (i.e. one core to the max) instead of below 13%. Anyway, looking into multicore LibSVM now. Also, I lost patience and killed the process to try out of the problem consists with all algorithms. k-NN ran through the training in under 20 minutes. Then again, k-NN is a far simpler and less complex algorithm. – Sebastian Hätälä Jul 21 '15 at 23:03
  • 7
    "Even then the PC has 4 cores" The i7 have 4 physical cores but they handle the work of 8 independent threads, so a single core application just get 1/8 ~ 12.5% of the CPU, which actually coincides with your scenario – Pedrom Jul 21 '15 at 23:15
  • Also keep in mind that SVM involves solving a quadratic programming optimization problems which complexity often is O(n^3). Also to avoid numerical problems which could lead on performance hits, it is recommended that the data is normalize in the [-1, 1] (or [0, 1]) range – Pedrom Jul 21 '15 at 23:23
  • You are right, thank you! So to work faster I need to find a way to run scikit-learn in multiple threads. Anyone got any ideas about that? – Sebastian Hätälä Jul 21 '15 at 23:25
  • it seems like some people uses joblib for that. http://stackoverflow.com/questions/24406937/scikit-learn-joblib-bug-multiprocessing-pool-self-value-out-of-range-for-i-fo – Pedrom Jul 22 '15 at 00:06
  • 2
    Things are not looking good, a complexity of `O(n^3)` would mean the training never terminates. The actual complexity might be lower though, but still `O(n^2)` (see http://stackoverflow.com/questions/16585465/training-complexity-of-linear-svm). I would use a very low C and run small-scale experiments, increasing the number of rows to see the evolution of the training time before launching it on the full dataset. Let us know how the experiment turned out for you! – ldirer Jul 22 '15 at 14:06
  • You should use a different algorithm for that amount of data. Try for example LinearSVC. – elyase Jul 22 '15 at 19:52
  • 1
    All in all k-NN and SVM (even linear SVM) are not applicable on such a large data set using just a single core. I experimented with getting sklearn / numpy to work with multiple cores but no luck. Therefore, I am now using algorithms from the kit that have a lower complexity. Instead of SVM I use SGD with hinge loss (which fits a linear SVM) and instead of k-NN I am using NearestCentroid (kind of similar since it also uses similarities). – Sebastian Hätälä Jul 28 '15 at 14:23

1 Answers1

1

You can try using accelerated implementations of algorithms - such as scikit-learn-intelex - https://github.com/intel/scikit-learn-intelex

For SVM you for sure would be able to get higher compute efficiency, however for such large datasets this still would be noticeable.

First install package

pip install scikit-learn-intelex

And then add in your python script

from sklearnex import patch_sklearn
patch_sklearn()