9

This might seems a stupid question, but I just can't come up with a reasonable answer.

It is said that regularization can help us obtain simple models over complex ones to avoid over-fitting. But for a linear classification problem:

f(x) = Wx

The complexity of the model is somewhat specified: it's linear, not quadratic or something more complex. So why do we still need regularization on the parameters? Why do we prefer smaller weights in such cases?

Amir
  • 10,600
  • 9
  • 48
  • 75
Demonedge
  • 1,363
  • 4
  • 18
  • 33
  • Is your question: Why does shrinking the parameters W to zero reduce the model complexity? Anyway - should probably be migrated to stats. – cel Jan 14 '16 at 14:01
  • Nope, I am asking why do we need R(w) in f(x)=wx+R(w). Because I think in linear classification, the complex of the model is same for any w we choose. But why do we prefer the smaller ones? – Demonedge Jan 14 '16 at 14:16
  • 1
    Well, if you don't what to know the answer to my question, I can easily answer yours: Because we want to reduce model complexity. A smaller `w` vector leads to a less complex model, less complex models are often preferred. See https://en.wikipedia.org/wiki/Occam%27s_razor, for a philosophical point of view, or https://en.wikipedia.org/wiki/Regularization_(mathematics) for a more mathematical point of view. – cel Jan 14 '16 at 14:22
  • Although, imo the wikipedia article is not that good because it fails to give an intuition HOW regularization helps to fight overfitting. There's an excellent section about that in "Pattern Recognition and Machine learning" by Christopher Bishop, but it does not seem like there's a free preview for that chapter. – cel Jan 14 '16 at 14:27
  • A thing I don't understand is that why different w changes the complexity of the model? We measure complexity of a model by its number of parameters, or it's choice of hypothesis(linear, quadratic, cubic or something else). But in linear classification, all these are same for different choice of w. So why different w causes differ model complexities? – Demonedge Jan 14 '16 at 14:33
  • Because all regularization techniques "shrink" `w` towards zero. Then you only have to understand why shrinking the parameters `w` to zero reduces model complexity and you have an intuitive understanding. – cel Jan 14 '16 at 14:36
  • I would recommend you to read Section 3 in [The Elements of Statistical Learning](https://books.google.ru/books?id=VRzITwgNV2UC&redir_esc=y). Best-subset selection drops all variables in discrete manner whereas shrinkage drops variables in continuous manner. Why do we need to drop some variables? just garbage collection. – serge_k Jan 15 '16 at 10:17

3 Answers3

4

The need to regularize a model will tend to be less and less as you increase the number of samples that you want to train the model with or your reduce the model's complexity. However, the number of examples needed to train a model without (or with a very very small regularization effect) increases [super]exponentially with the number of parameters and possibly some other factors inherit in a model.

Since in most machine learning problems, we do not have the required number of training samples or the model complexity is large we have to use regularization in order to avoid, or lessen the possibility, of over-fitting. Intuitively, the way regularization works is it introduces a penalty term to argmin∑L(desired,predictionFunction(Wx)) where L is a loss function that computes how much the model's prediction deviates from the desired targets. So the new loss function becomes argmin∑L(desired,predictionFunction(Wx)) + lambda*reg(w) where reg is a type of regularization (e.g. squared L2) and lambda is a coefficient that controls the regularization effect. Then, naturally, while minimizing the cost function the weight vectors are restricted to have a small squared length (e.g. squared L2 norm) and shrink towards zero. This is because the larger the squared length of weight vectors, the higher the loss is. Therefore the weight vectors also need to compensate for lowering the model's loss while the optimization is running.

Now imagine if you remove the regularization term (lambda = 0). Then the model parameters are free to have any values and so do the squared length of weight vectors can grow no matter you have a linear or non-linear model. This adds another dimension to the complexity of the model (in addition to the number of parameters) and the optimization procedure may find weight vectors that can exactly match the training data points. However, when exposed to unseen (validation or test) data sets the model will not be able to generalize well since it has over-fitted to the training data.

Amir
  • 10,600
  • 9
  • 48
  • 75
0

regularization is used in machine learning models to cope with the problem of overfitting i.e. when the difference between training error and the test error is too high. Coming to linear models like logistic regression, the model might perform very well on your training data and it is trying to predict each data point with so much precision. This scenario leads to overfitting of the data because there might be the case that your model is also fitting the outliers which can cause huge trouble to your model.

enter image description here

This link shows the logistic regression equation with an l2 regularizer which has a lambda parameter that helps to reduce the effect of Loss part but the value of lambda should not be too high because then it will lead to underfitting and eventually your model will become dumb.

Mayur Karmur
  • 2,119
  • 14
  • 35
Aditya
  • 950
  • 8
  • 37
0

Major reason of using regularization is to overcome issue of overfitting. When your model fits data to well i.e. capture all noise as well, regularization penalizes the weights. You may read more and get mathematical intuition with implementation details in Reference