6

I'm working on machine learning problem and want to use linear regression as learning algorithm. I have implemented 2 different methods to find parameters theta of linear regression model: Gradient (steepest) descent and Normal equation. On the same data they should both give approximately equal theta vector. However they do not.

Both theta vectors are very similar on all elements but the first one. That is the one used to multiply vector of all 1 added to the data.

Here is how the thetas look like (fist column is output of Gradient descent, second output of Normal equation):

Grad desc Norm eq
-237.7752 -4.6736
-5.8471   -5.8467
9.9174    9.9178
2.1135    2.1134
-1.5001   -1.5003
-37.8558  -37.8505
-1.1024   -1.1116
-19.2969  -19.2956
66.6423   66.6447
297.3666  296.7604
-741.9281 -744.1541
296.4649  296.3494
146.0304  144.4158
-2.9978   -2.9976
-0.8190   -0.8189

What can cause the difference in theta(1, 1) returned by gradient descent compared to theta(1, 1) returned by normal equation? Do I have bug in my code?

Here is my implementation of normal equation in Matlab:

function theta = normalEque(X, y)
    [m, n] = size(X);
    X = [ones(m, 1), X];
    theta = pinv(X'*X)*X'*y;
end

Here is code for gradient descent:

function theta = gradientDesc(X, y)
    options = optimset('GradObj', 'on', 'MaxIter',  9999);
    [theta, ~, ~] = fminunc(@(t)(cost(t, X, y)),...
                    zeros(size(X, 2), 1), options);
end

function [J, grad] = cost(theta, X, y)
    m = size(X, 1);
    X = [ones(m, 1), X];
    J = sum((X * theta - y) .^ 2) ./ (2*m);
    for i = 1:size(theta, 1)
        grad(i, 1) = sum((X * theta - y) .* X(:, i)) ./ m;
    end
end

I pass exactly the same data X and y to both functions (I do not normalize X).

Edit 1:

Based on answers and comments I checked few my code and run some tests.

First I want to check if the problem can be cause by X beeing near singular as suggested by @user1489497's answer. So I replaced pinv by inv - and when run it I really got warning Matrix is close to singular or badly scaled.. To be sure that that is not the problem I obtained much larger dataset and run tests with this new dataset. This time inv(X) did not display the warning and using pinv and inv gave same results. So I hope that X is not close to singular any more.

Then I changed normalEque code as suggested by woodchips so now it looks like:

function theta = normalEque(X, y)
    X = [ones(size(X, 1), 1), X];
    theta = pinv(X)*y;
end

However the problem is still there. New normalEque function on new data that are not close to singular gives different theta as gradientDesc.

To find out which algorithm is buggy I have run linear regression algorithm of data mining software Weka on the same data. Weka computed theta very similar to output of normalEque but different to the output of gradientDesc. So I guess that normalEque is correct and there is a bug in gradientDesc.

Here is comparison of thetas computed by Weka, normalEque and GradientDesc:

Weka(correct) normalEque    gradientDesc
779.8229      779.8163      302.7994
  1.6571        1.6571        1.7064
  1.8430        1.8431        2.3809
 -1.5945       -1.5945       -1.5964
  3.8190        3.8195        5.7486
 -4.8265       -4.8284      -11.1071
 -6.9000       -6.9006      -11.8924
-15.6956      -15.6958      -13.5411
 43.5561       43.5571       31.5036
-44.5380      -44.5386      -26.5137
  0.9935        0.9926        1.2153
 -3.1556       -3.1576       -1.8517
 -0.1927       -0.1919       -0.6583
  2.9207        2.9227        1.5632
  1.1713        1.1710        1.1622
  0.1091        0.1093        0.0084
  1.5768        1.5762        1.6318
 -1.3968       -1.3958       -2.1131
  0.6966        0.6963        0.5630
  0.1990        0.1990       -0.2521
  0.4624        0.4624        0.2921
-12.6013      -12.6014      -12.2014
 -0.1328       -0.1328       -0.1359

I also computed errors as suggested by Justin Peel's answer. Output of normalEque gives slightly lesser squared error but the difference is marginal. What is more when I compute gradient of cost of theta using function cost (the same as the one used by gradientDesc) I got gradient near zero. Same done on output of gradientDesc does not give gradient near zero. Here is what I mean:

>> [J_gd, grad_gd] = cost(theta_gd, X, y, size(X, 1));
>> [J_ne, grad_ne] = cost(theta_ne, X, y, size(X, 1));
>> disp([J_gd, J_ne])
  120.9932  119.1469
>> disp([grad_gd, grad_ne])
  -0.005172856743846  -0.000000000908598
  -0.026126463200876  -0.000000135414602
  -0.008365136595272  -0.000000140327001
  -0.094516503056041  -0.000000169627717
  -0.028805977931093  -0.000000045136985
  -0.004761477661464  -0.000000005065103
  -0.007389474786628  -0.000000005010731
   0.065544198835505  -0.000000046847073
   0.044205371015018  -0.000000046169012
   0.089237705611538  -0.000000046081288
  -0.042549228192766  -0.000000051458654
   0.016339232547159  -0.000000037654965
  -0.043200042729041  -0.000000051748545
   0.013669010209370  -0.000000037399261
  -0.036586854750176  -0.000000027931617
  -0.004761447097231  -0.000000027168798
   0.017311225027280  -0.000000039099380
   0.005650124339593  -0.000000037005759
   0.016225097484138  -0.000000039060168
  -0.009176443862037  -0.000000012831350
   0.055653840638386  -0.000000020855391
  -0.002834810081935  -0.000000006540702
   0.002794661393905  -0.000000032878097

This would suggest that gradient descent simply did not converge to global minimum... But that is hardly the case as I run it for thousands of iterations. So where is the bug?

Community
  • 1
  • 1
Rasto
  • 17,204
  • 47
  • 154
  • 245
  • 1
    Use of pinv the way you have is a terrible thing to do! theta=pinv(X)*y is all you need. –  Jun 30 '12 at 11:50
  • @woodchips Thanks, that code is from older version with regularisation when it looked like this: `theta = pinv(Xones'*Xones + lambda.*eyen)*Xones'*y;`. I forgot to changed when I removed regularisation. – Rasto Jun 30 '12 at 12:47
  • @woodchips please check my edit tu my original question. There is also response to your comment. – Rasto Jun 30 '12 at 15:19

4 Answers4

7

I finally had time to get back to this. There is no "bug".

If the matrix is singular, then there are infinitely many solutions. You can choose any solution from that set, and get equally as good an answer. The pinv(X)*y solution is a good one that many like because it is the minimum norm solution.

There is NEVER a good reason to use inv(X)*y. Even worse, is to use inverse on the normal equations, thus inv(X'*X)*X'*y is simply numerical crap. I don't care who told you to use it, they are guiding you to the wrong place. (Yes, it will work acceptably for problems that are well-conditioned, but most of the time you don't know when it is about to give you crap. So why use it?)

The normal equations are in general a bad thing to do, EVEN if you are solving a regularized problem. There are ways to do that that avoid squaring the condition number of the system, although I won't explain them unless asked as this answer has gotten long enough.

X\y will also yield a result that is reasonable.

There is ABSOLUTELY no good reason to throw an unconstrained optimizer at the problem, as this will yield results that are unstable, completely dependent on your starting values.

As an example, I'll start with a singular problem.

X = repmat([1 2],5,1);
y = rand(5,1);

>> X\y
Warning: Rank deficient, rank = 1, tol =  2.220446e-15. 
ans =
                         0
         0.258777984694222

>> pinv(X)*y
ans =
         0.103511193877689
         0.207022387755377

pinv and backslash return slightly different solutions. As it turns out, there is a basic solution, to which we can add ANY amount of the nullspace vector for the row space of X.

null(X)
ans =
         0.894427190999916
        -0.447213595499958

pinv generates the minimum norm solution. Of all of the solutions that might have resulted, this one has minimum 2-norm.

In contrast, backslash generates a solution that will have one or more variables set to zero.

But if you use an unconstrained optimizer, it will generate a solution that is completely dependent on your starting values. Again, ANLY amount of that null vector can be added to your solution, and you still have an entirely valid solution, with the same value of the sum of squares of errors.

Note that even though no singularity waring is returned, this need not mean your matrix is not close to singular. You have changed little about the problem, so it is STILL close, just not enough to trigger the warning.

  • Thank you for your answer. It is helpful. But I have good reasons to use unconstrained optimizer. First, its results should be stable because cost function of linear regression (sum of squares of errors) always have global minimum . Therefore result is independent on starting location (minimizer should always find same global min). I the fact I implemented normal equation just as a test check if my gradient descent works correctly (does not seem to :( ). The reason why I cannot use normal equation method is that when I'm sure that current version of gradient descent is implemented correctly – Rasto Jun 30 '12 at 16:13
  • ... I'll change it to something more complicated that cannot be done using normal equation (or is extremely hard to do so). So I really need GD algorihm to be applied to the problem. I'll run some more tests with much larger data set - like 3 times larger then the one I used last time (the one that did not trigger warning). That should ensure my data are not close to singular. If understood your answer correctly I should then get same results from both GD and NE. – Rasto Jun 30 '12 at 16:16
  • 1
    NO. The sums of squares of global errors does NOT always have a unique solution. What do you think I have been saying? In fact, it is completely trivial to provide problems where the solution lies anywhere along a valley that has constant depth and extends to infinity in either direction. –  Jun 30 '12 at 16:52
  • Sorry, I should have written that it is convex... That is not quite the same as if it had global minimum just because those valleys of constant depth can exists... But well if I find any point at the bottom of the valley I cannot do any better so it is the same for me as reaching global optima. So you think this is the case and therefore I got different result from GD and NE? But the valley should not have constant depth if X is not absolutely singular, right? So after enough iterations GD should still find global minimum shouldn't it? – Rasto Jun 30 '12 at 18:09
  • I should also mention that normal equation `theta = pinv(X)*y;` gives `theta` very similar to one yield by `theta = X\y;`. Relative difference is within 1e-6. That may suggest that there really is local minimum right? – Rasto Jun 30 '12 at 18:12
  • So I've done some more tests. **It looks like you were right.** I have tested both GD and NE on larger dataset that is less likely to be close to singular. As the result I finally got very similar `theta` vectors from both algorithms (as there is clear global min when data are far from singular). Thank you for help. – Rasto Jun 30 '12 at 21:10
2

As others mentioned, an ill-conditioned hessian matrix is likely the cause of your problem.

The number of steps that standard gradient descent algorithms take to reach a local optimum is a function of the largest eigenvalue of the hessian divided by the smallest (this is known as the condition number of the Hessian). So, if your matrix is ill-conditioned, then it could take an extremely large number of iterations for gradient descent to converge to an optimum. (For the singular case, it could converge to many points, of course.)

I would suggest trying three different things to verify that an unconstrained optimization algorithm works for your problem (which it should): 1) Generate some synthetic data by computing the result of a known linear function for random inputs and adding a small amount of gaussian noise. Make sure that you have many more data points than dimensions. This should produce a non-singular hessian. 2) Add a regularization terms to your error function to increase the condition number of the hessian. 3) Use a second order method like conjugate gradient or L-BFGS rather than gradient descent to reduce the number of steps needed for the algorithm to converge. (You will probably need to do this in conjunction with #2).

user1149913
  • 4,463
  • 1
  • 23
  • 28
  • Thanks for your answer. I considered my question to be answered by @woodchips but your answer was very helpful for my understanding of math behind it. I was already using regularisation before but I did not provide code with regularisation here (I simplified my code before pasting it here) so that my question was as clear and understandable as possible. Can you please suggest some L-BFGS implementation for matlab? – Rasto Jul 12 '12 at 23:01
  • It looks like fminunc uses a second order method (bfgs). This should be enough to check if the convergence behavior is better than gradient descent. – user1149913 Jul 17 '12 at 13:04
1

Could you post a little more about what you X looks like? You're using pinv() which is Moore-Penrose pseudo inverse. If the matrix is ill-conditioned this could cause problems with obtaining the inverse. I would bet that the gradient-descent method is closer to the mark.

user1489497
  • 127
  • 9
  • 1
    HUH???? If the problem is near singular, then use of fminunc on the sums of squares will never be the choice you want. –  Jun 30 '12 at 11:50
  • @woodchips anyway it looks like the problem is near singular because when I replace `pinv` by `inv` in NE code and run it I got this warning: `Matrix is close to singular or badly scaled.`. Also see update on my question in few minutes. – Rasto Jun 30 '12 at 11:57
  • Please check Edit 1 to my original question. There is also response to your answer. – Rasto Jun 30 '12 at 15:20
0

You should see which method is actually giving you the smallest error. That will indicate which method is struggling. I suspect that the normal equation method is the troubled solution because if X is ill-conditioned then you can have some problems there.

You should probably replace your normal equation solution with theta = X\y which will use a QR-decomposition method to solve it.

Justin Peel
  • 46,722
  • 6
  • 58
  • 80