43

From my research, I found three conflicting results:

  1. SVC(kernel="linear") is better
  2. LinearSVC is better
  3. Doesn't matter

Can someone explain when to use LinearSVC vs. SVC(kernel="linear")?

It seems like LinearSVC is marginally better than SVC and is usually more finicky. But if scikit decided to spend time on implementing a specific case for linear classification, why wouldn't LinearSVC outperform SVC?

desertnaut
  • 57,590
  • 26
  • 140
  • 166
THIS USER NEEDS HELP
  • 3,136
  • 4
  • 30
  • 55
  • 6
    It is not that scikit-learn developed a dedicated algorithm for linear SVM. Rather they implemented interfaces on top of two popular existing implementations. The underlying C implementation for `LinearSVC` is liblinear, and the solver for `SVC` is libsvm. A third is implementation is `SGDClassifier(loss="hinge")`. – David Maust Jan 29 '16 at 05:41
  • Possible duplicate of [Under what parameters are SVC and LinearSVC in scikit-learn equivalent?](http://stackoverflow.com/questions/33843981/under-what-parameters-are-svc-and-linearsvc-in-scikit-learn-equivalent) – lejlot Jan 31 '16 at 20:47

2 Answers2

43

Mathematically, optimizing an SVM is a convex optimization problem, usually with a unique minimizer. This means that there is only one solution to this mathematical optimization problem.

The differences in results come from several aspects: SVC and LinearSVC are supposed to optimize the same problem, but in fact all liblinear estimators penalize the intercept, whereas libsvm ones don't (IIRC). This leads to a different mathematical optimization problem and thus different results. There may also be other subtle differences such as scaling and default loss function (edit: make sure you set loss='hinge' in LinearSVC). Next, in multiclass classification, liblinear does one-vs-rest by default whereas libsvm does one-vs-one.

SGDClassifier(loss='hinge') is different from the other two in the sense that it uses stochastic gradient descent and not exact gradient descent and may not converge to the same solution. However the obtained solution may generalize better.

Between SVC and LinearSVC, one important decision criterion is that LinearSVC tends to be faster to converge the larger the number of samples is. This is due to the fact that the linear kernel is a special case, which is optimized for in Liblinear, but not in Libsvm.

eickenberg
  • 14,152
  • 1
  • 48
  • 52
  • in the [official documentation of scikit learn](https://scikit-learn.org/stable/modules/svm.html#mathematical-formulation), it seems that the math formula doesn't indicate the intercept is penalized. or do I misunderstand? – John Smith Oct 08 '20 at 07:47
18

The actual problem is in the problem with scikit approach, where they call SVM something which is not SVM. LinearSVC is actually minimizing squared hinge loss, instead of just hinge loss, furthermore, it penalizes size of the bias (which is not SVM), for more details refer to other question: Under what parameters are SVC and LinearSVC in scikit-learn equivalent?

So which one to use? It is purely problem specific. As due to no free lunch theorem it is impossible to say "this loss function is best, period". Sometimes squared loss will work better, sometimes normal hinge.

Community
  • 1
  • 1
lejlot
  • 64,777
  • 8
  • 131
  • 164
  • you are right about the squared hinge loss (I mention the losses are different also). But setting it to hinge still doesn't make them yield the same answer as the SVC with linear kernel. – eickenberg Jan 31 '16 at 20:07
  • as I said - this is also about penalizing bias, relate to my other answer – lejlot Jan 31 '16 at 20:17
  • Indeed, so this question is pretty much a duplicate of what you have already answered. But it is important to note that `LinearSVC` is not useless - it should scale better than the generic kernel methods. – eickenberg Jan 31 '16 at 20:37
  • sure, "no free lunch theorem", every classifier has its niche – lejlot Jan 31 '16 at 20:47