3

I am using SVC from scikit-learn on a large dataset of 10000x1000 (10000 objects with 1000 features). I already saw in other sources that SVMLIB doesn't scale well beyond ~10000 objects and I indeed observe this:

training time for 10000 objects: 18.9s
training time for 12000 objects: 44.2s
training time for 14000 objects: 92.7s

You can imagine what happens when I trying to 80000. However, what I found very surprising is the fact that the SVM's predict() takes even more time than the training fit():

prediction time for 10000 objects (model was also trained on those objects): 49.0s
prediction time for 12000 objects (model was also trained on those objects): 91.5s
prediction time for 14000 objects (model was also trained on those objects): 141.84s

It is trivial to get prediction to run in linear time (although it might be close to linear here), and usually it is much faster than training. So what is going on here?

Bitwise
  • 7,577
  • 6
  • 33
  • 50
  • 3
    Are you doing predictions in batches? The `SVC.predict` method, unfortunately, incurs a lot of overhead because it has to reconstruct a LibSVM data structure similar to the one that the training algorithm produced, shallow-copy in the support vectors, and convert the test samples to a LibSVM format that may be different from the NumPy/SciPy formats. Therefore, prediction on a single sample is bound to be slow. – Fred Foo Mar 30 '13 at 14:40

1 Answers1

2

Are you sure you do not include the training time in your measure of the prediction time? Do you have a code snippet for your timings?

ogrisel
  • 39,309
  • 12
  • 116
  • 125
  • Completely sure, I am timing only a single command: either fit() or predict(). I am experienced with these sorts of tools and it works fine for other models. – Bitwise Mar 29 '13 at 15:06
  • 1
    Ok then it's just that the number of support vectors is growing linearly with the training set (for instance if almost all samples are selected as SVs) and then the nb of samples is growing as well hence the quadratic time evolution. – ogrisel Mar 29 '13 at 16:52
  • good point, that could indeed be the reason for the non-linear growth. You would expect linear only if you would use the *same* model on a growing amount of samples. However, I am still surprised that the prediction is slower than the training for a problem of this size. – Bitwise Mar 29 '13 at 17:50
  • Yes those predictions times seem to be very slow. There might be another issue. How many different classes do you have in your dataset? What is the shape of the `svc.support_vectors_` and `svc.dual_coef_`? – ogrisel Mar 30 '13 at 21:41