why gradient descent when we can solve linear regression analytically

Question

what is the benefit of using Gradient Descent in the linear regression space? looks like the we can solve the problem (finding theta0-n that minimum the cost func) with analytical method so why we still want to use gradient descent to do the same thing? thanks

This is a great question. Its very common for lecturers to go directly into gradient descent to find the solution, which is confusing when a student remembers the the ordinary least squares solution doesn't require an optimization algorithm; confusion which could quickly be dispensed with by acknowledging what @jabaldonedo has provided here. — Merlin, Nov 10 '17 at 18:55

score 110 · Accepted Answer · answered Aug 12 '13 at 19:15

110

When you use the normal equations for solving the cost function analytically you have to compute:

enter image description here

Where X is your matrix of input observations and y your output vector. The problem with this operation is the time complexity of calculating the inverse of a nxn matrix which is O(n^3) and as n increases it can take a very long time to finish.

When n is low (n < 1000 or n < 10000) you can think of normal equations as the better option for calculation theta, however for greater values Gradient Descent is much more faster, so the only reason is the time :)

answered Aug 12 '13 at 19:15

jabaldonedo

25,822
8
77
77

1

Is n the number of samples or features? – Mike Vella Nov 30 '13 at 03:41
6

n is the number of features. – gavinmh Mar 31 '14 at 02:11
2

This is not necessarily the bottleneck. To even use the normal equations, we typically make a non-singular assumption so that $n > p$ (here I'm using the notation that $n$ is number of data points and $p$ is number of features). This means the bottleneck is $O(np^2)$ to form $X^\top X$, not the $O(p^3)$ inversion. – Dustin Tran Mar 22 '16 at 22:50
1

just a nitty-gritty... matrix inversion can be performed below O(n^3) https://en.wikipedia.org/wiki/Computational_complexity_of_mathematical_operations#Matrix_algebra – fr_andres Jul 19 '17 at 00:01
1

i was excited for a second after reading @fr_andres comment, but the best matrix inversion is O(n^2.4) – Merlin Nov 10 '17 at 19:01
to be consistent with the nitty-grittiness, it is actually less than 2.4 :^) – fr_andres Nov 11 '17 at 22:17
As a side note - for very large matrices X, the way the inverse is computed X^-1 is using gradient descent (well, a more complex version of if - the preconditioned conjugate gradient). I am not just talking in this context, I mean any inverse which is kind of cool. For example, MatLab has this implementation: https://www.mathworks.com/help/matlab/ref/pcg.html – Jesper - jtk.eth Apr 18 '19 at 00:39

score 13 · Answer 2 · answered Aug 12 '13 at 18:44

You should provide more details about yout problem - what exactly are you asking about - are we talking about linear regression in one or many dimensions? Simple or generalized ones?

In general, why do people use the GD?

it is easy to implement
it is very generic optimization technique - even if you change your model to the more general one, you can stil use it

So what about analytical solutions? Well, we do use them, your claim is simply false here (if we are talking in general), for example the OLS method is a closed form, analytical solution, which is widely used. If you can use the analytical solution, it is affordable computationaly (as sometimes GD is simply cheapier or faster) then you can, and even should - use it.

Neverlethles this is always a matter of some pros and cons - analytical solutions are strongly connected to the model, so implementing them can be inefficient if you plan to generalize/change your models in the future. They are sometimes less efficient then their numerical approximations, and sometimes there are simply harder to implement. If none of above is true - you should use the analytical solution, and people do it, really.

To sum up, you rather use GD over analytical solution if:

you are considering changes in the model, generalizations, adding some more complex terms/regularization/modifications
you need a generic method because you do not know much about the future of the code and the model (you are only one of the developers)
analytical solution is more expensive computationaly, and you need efficiency
analytical solution requires more memory, which you do not have
analytical solution is hard to implement and you need easy, simple code

score 7 · Answer 3 · edited Apr 13 '17 at 12:44

I saw a very good answer from https://stats.stackexchange.com/questions/23128/solving-for-regression-parameters-in-closed-form-vs-gradient-descent

Basically, the reasons are:

1.For most nonlinear regression problems there is no closed form solution.

2.Even in linear regression (one of the few cases where a closed form solution is available), it may be impractical to use the formula. The following example shows one way in which this can happen.

score 0 · Answer 4 · answered Sep 07 '17 at 08:23

Other reason is that gradient descent is immediately useful when you generalize linear regression, especially if the problem doesn't have a closed-form solution, like for example in Lasso (which adds regularization term consisting on sum of absolute values of weight vector).

why gradient descent when we can solve linear regression analytically

4 Answers4

Linked