16

I'm trying to understand the difference between RidgeClassifier and LogisticRegression in sklearn.linear_model. I couldn't find it in the documentation.

I think I understand quite well what the LogisticRegression does.It computes the coefficients and intercept to minimise half of sum of squares of the coefficients + C times the binary cross-entropy loss, where C is the regularisation parameter. I checked against a naive implementation from scratch, and results coincide.

Results of RidgeClassifier differ and I couldn't figure out, how the coefficients and intercept are computed there? Looking at the Github code, I'm not experienced enough to untangle it.

The reason why I'm asking is that I like the RidgeClassifier results -- it generalises a bit better to my problem. But before I use it, I would like to at least have an idea where does it come from.

Thanks for possible help.

Peter Franek
  • 577
  • 3
  • 8
  • 25
  • 2
    have you read about regularization in Machine Learning? – Sociopath Dec 24 '18 at 09:59
  • Maybe this can help: https://scikit-learn.org/stable/modules/linear_model.html#ridge-regression – Vivek Kumar Dec 24 '18 at 09:59
  • 2
    @Sociopath Yes. A basic l2-regularization is present in the LogisticRegression() already, as outlined in the text. – Peter Franek Dec 24 '18 at 10:05
  • 2
    @VivekKumar Have been there, thanks -- but unfortunately, it only added to my confusion because everything in the documentation looks like it should be the same thing as LogisticRegression. (From the docs it would seem that the alpha should coincide with 1/C from LogisticRegression) – Peter Franek Dec 24 '18 at 10:06

1 Answers1

20

RidgeClassifier() works differently compared to LogisticRegression() with l2 penalty. The loss function for RidgeClassifier() is not cross entropy.

RidgeClassifier() uses Ridge() regression model in the following way to create a classifier:

Let us consider binary classification for simplicity.

  1. Convert target variable into +1 or -1 based on the class in which it belongs to.

  2. Build a Ridge() model (which is a regression model) to predict our target variable. The loss function is MSE + l2 penalty

  3. If the Ridge() regression's prediction value (calculated based on decision_function() function) is greater than 0, then predict as positive class else negative class.

For multi-class classification:

  1. Use LabelBinarizer() to create a multi-output regression scenario, and then train independent Ridge() regression models, one for each class (One-Vs-Rest modelling).

  2. Get prediction from each class's Ridge() regression model (a real number for each class) and then use argmax to predict the class.

Venkatachalam
  • 16,288
  • 9
  • 49
  • 77
  • 1
    Thanks, yes, but results of RidgeClassifier with default parameters (alpha =1) do not coincide either with l2-regularization, nor with the unregularized case (C = infinity in LogReg).. I'm just trying to find out what does it to. Unfortunately, my question is more focused on a particular implementation than on the regression math in general :-( – Peter Franek Dec 24 '18 at 10:16
  • 1
    I really appreciate your time and effort to help me. Hate to say it, but it still doesn't answer my question. (Let's omit the discussion on the "half" that you removed). The point is that the problem of solving l2-regularized LogReg is so simple (and, moreover, convex) that essentially *any* method converges to the same guy, and very fast. As outlined in the text, I compared LogisticRegression method with my own najive implementation (basic gradient descent, any reasonable number of steps...) and results coincide up to 5 decimal places. But Ridge returns something completely different... – Peter Franek Dec 24 '18 at 10:40
  • 1
    I didnot remove half in your loss function, I just replaced the C value – Venkatachalam Dec 24 '18 at 10:47
  • 2
    Found the reason, updating my answer. one line- they use ridge regression model to build the ridgeClassifier. – Venkatachalam Dec 24 '18 at 10:54
  • I'm also trying to read that code -- seems like they first convert the values to 1 and -1's, and then treat it as "continuous" regression problem (forget that it's a classification problem for a while). Is that right? The core may be the "super" on line 853 that calls the _BaseRidge fit method... https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/linear_model/ridge.py – Peter Franek Dec 24 '18 at 10:58
  • 1
    ya exactly. then use decision function to find the class – Venkatachalam Dec 24 '18 at 10:59