2

I am currently learning ML on coursera with the help of course on ML by Andrew Ng. I am performing the assignments in python because I am more used to it rather than Matlab. I have recently come to a problem regarding my understanding of the topic of Regularization. My understanding is that by doing regularization, one can add less important features which are important enough in prediction. But while implementing it, I don't understand why the 1st element of theta(parameters) i.e theta[0] is skipped while calculating the cost. I have referred other solutions but they also have done the same skipping w/o explanation.

Here is the code:

`

 term1 = np.dot(-np.array(y).T,np.log(h(theta,X)))
 term2 = np.dot((1-np.array(y)).T,np.log(1-h(theta,X)))
 regterm = (lambda_/2) * np.sum(np.dot(theta[1:].T,theta[1:])) #Skip theta0. Explain this line
 J=float( (1/m) * ( np.sum(term1 - term2) + regterm ) )
 grad=np.dot((sigmoid(np.dot(X,theta))-y),X)/m
 grad_reg=grad+((lambda_/m)*theta)
 grad_reg[0]=grad[0]

`

And here is the formula:

Regularized Cost function

Here J(theta) is cost function h(x) is the sigmoid function or hypothesis. lamnda is the regularization parameter.

Aditya
  • 61
  • 7
  • I believe the proper place for this question is [stats.stackexchange.com](http://stats.stackexchange.com). There you might have better luck with a more formal explanation (e.g. they have MathML support). Please link the question here if you decide to move it there. Anyway, I'd advise to look at the same problem in Ridge Regression, ESLII has some note about this (section 3.4.1) which I honestly didn't find quite satisfactory... – filippo Jan 03 '19 at 08:37
  • @filippo I don't mind moving it. Tell me how to I'll do it. – Aditya Jan 03 '19 at 12:35
  • I guess only mods can migrate questions, but you could just ask a new one there – filippo Jan 03 '19 at 14:06

1 Answers1

1

Theta0 is referring to bias. Bias comes in to picture when we want our decision boundaries to be separated properly. just consider an example of

Y1=w1 * X and then Y2= w2 * X

when the values of X comes close to zero, there could be a case when its a tough deal to separate them, here comes bias into the role.

Y1=w1 * X + b1 and Y2= w2 * X + b2

now, via learning, the decision boundaries will be clear all the time.

Let’s consider why we use regularization now.

So that we don’t over-fit, and smoothen the curve. As you can see the equation, its the slopes w1 and w2, that needs smoothening, bias are just the intercepts of segregation. So, there is no point of using them in regularization.

Although we can use it, in the case of neural networks it won’t make any difference. But we might face the issues of reducing bias value so much, that it might confuse data points. Thus, it's better to not use Bias in Regularization.

Hope it answers your question. Originally published: https://medium.com/@shrutijadon10104776/why-we-dont-use-bias-in-regularization-5a86905dfcd6

Chinmay Das
  • 400
  • 6
  • 18
  • But when calculating gradient we use bias as well, why? And how to know, when to include bias and when not? And how can I know how many parameters should I consider for theta? – Aditya Jan 03 '19 at 08:44
  • **how can I know how many parameters should I consider for theta?** Try the number of parameters as long as the difference between test set error and train set error is not too high. For the first question referred to this [link](https://stackoverflow.com/questions/2480650/role-of-bias-in-neural-networks) – Chinmay Das Jan 03 '19 at 10:33