7

For an NxP matrix x and an Nx1 vector y with N > P, the two expressions

x \ y                                       -- (1)

and

(x' * x) \ (x' * y)                         -- (2)

both compute the solution b to the matrix equation

x * b = y

in the least squares sense, i.e. so that the quantity

norm(y - x * b)

is minimized. Expression (2) does it using the classic algorithm for the solution of an ordinary least squares regression, where the left-hand argument to the \ operator is square. It is equivalent to writing

inv(x' * x) * (x' * y)                      -- (3)

but it uses an algorithm which is more numerically stable. It turns out that (3) is moderately faster than (2) even though (2) doesn't have to produce the inverse matrix as a byproduct, but I can accept that given the additional numerical stability.

However, some simple timings (with N=100,000 and P=30) show that expression (2) is more than 5 times faster than expression (1), even though (1) has greater flexibility to choose the algorithm used! For example, any call to (1) could just dispatch on the size of X, and in the case N>P it could reduce to (2), which would add a tiny amount of overhead, but certainly wouldn't take 5 times longer.

What is happening in expression (1) that is causing it to take so much longer?


Edit: Here are my timings

x = randn(1e5, 30);
y = randn(1e5,1);
tic, for i = 1:100; x\y; end; t1=toc;
tic, for i = 1:100; (x'*x)\(x'*y); end; t2=toc;
assert( abs(norm(x\y) - norm((x'*x)\(x'*y))) < 1e-10 );
fprintf('Speedup: %.2f\n', t1/t2)

Speedup: 5.23

Daniel
  • 36,610
  • 3
  • 36
  • 69
Chris Taylor
  • 46,912
  • 15
  • 110
  • 154
  • 1
    Huh, on Matlab 2013a I get a speedup factor of 10. – rubenvb Apr 03 '14 at 09:09
  • Just tried in on R2013a and got 9.2. – Dan Apr 03 '14 at 09:16
  • 1
    http://stackoverflow.com/questions/18553210/ and http://scicomp.stackexchange.com/questions/1001/ may be of interest, in particular the `spparms('spumoni',1)` trick mentioned in the latter by rcompton. – nkjt Apr 03 '14 at 09:18

2 Answers2

5

You are aware of the fact that in your test

size(x) == [1e5  30]    but   size(x'*x) == [30  30]
size(y) == [1e5   1]    but   size(x'*y) == [30   1]

That means that the matrices entering the mldivide function differ in size by 4 orders of magnitude! This would render any overhead of determining which algorithm to use rather large and significant (and perhaps also running the same algorithm on the two different problems).

In other words, you have a biased test. To make a fair test, use something like

x = randn(1e3);
y = randn(1e3,1);

I find (worst of 5 runs):

Speedup: 1.06  %// R2010a
Speedup: 1.16  %// R2010b
Speedup: 0.97  %// R2013a

...the difference has all but evaporated.

But, this does show very well that if you indeed have a regression problem with low dimensionality compared to the number of observations, it really pays off to do the multiplication first :)

mldivide is a catch-all, and really great at that. But often, having knowledge about the problem may make more specific solutions, like pre-multiplication, pre-conditioning, lu, qr, linsolve, etc. orders of magnitude faster.

Rody Oldenhuis
  • 37,726
  • 7
  • 50
  • 96
  • 1
    Rody, that is exactly my point! When you enter `mldivide(x,y)` with more observations than features, it is *mathematically equivalent* to `mldivide(x'*x,x'*y)` so I am wondering why MATLAB doesn't detect that case and do the pre-multiplication itself (since the second one is 5-10x faster even when you take the time required for the multiplications into account). – Chris Taylor Apr 03 '14 at 12:23
  • @ChrisTaylor: @ChrisTaylor: it is not *quite* the same...Try both versions on this: `x=[1 1; 0 3*eps; 3*eps 0]; y=[2; 3*eps; 3*eps];` – Rody Oldenhuis Apr 03 '14 at 12:38
  • @ChrisTaylor: also, "mathematically equivalent" has often little to do with how computers work :) Try this classic: `1.1-0.2 == 0.9` – Rody Oldenhuis Apr 03 '14 at 12:43
  • Ok, I take your point - you can lose precision if you do the pre-multiplication resulting in `x'*x` being singular. Calling `x\y` directly uses (presumably) a more stable algorithm but slower so it gets the right answer. I think that's the answer I'm looking for (though I'm still unsure why you can't check to see if a pre-multiplication would be valid, and perform it if so). – Chris Taylor Apr 03 '14 at 13:29
  • @ChrisTaylor: well, that's the thing with closed source; you'll never be *entirely* sure of implementation details, design decisions, etc. I'm pretty sure they have good reasons not to do what you suggest. The multiplication you suggest quite likely involves multiplications of numbers of very different magnitudes. This is very prone to [loss of significance](http://en.wikipedia.org/wiki/Loss_of_significance); reason enough for me to decide *against* using it in the catch-all. I agree, probably some optimization could be made there, but then again, *simplicity* is also a virtue... – Rody Oldenhuis Apr 03 '14 at 17:06
  • Rody, I feel like your second comment answers the question perfectly - any chance you want to edit to include it? – Chris Taylor Apr 05 '14 at 14:37
2

even though (1) has greater flexibility to choose the algorithm used! For example, any call to (1) could just dispatch on the size of X, and in the case N>P it could reduce to (2), which would add a tiny amount of overhead, but certainly wouldn't take 5 times longer.

This is not the case. It could take a lot of overhead to choose which algorithm to use, particularly when compared to the computation on relatively small inputs such as these. In this case, because MATLAB can see that you have x'*x, it knows that one of the arguments must be both square and symmetric (yes - that knowledge of linear algebra is built in to MATLAB even at a parser level), and can straight away call one of the appropriate code paths within \.

I can't say whether this fully explains the timing differences you're seeing. I would want to investigate further, at least by:

  1. Making sure to put the code within a function, and warming the function up to ensure that the JIT is engaged - and then trying the same thing with feature('accel', 'off') to remove the effect of the JIT
  2. Trying this on a much bigger range of input sizes to check what contribution an 'algorithm choice overhead' made compared to computation time.
Sam Roberts
  • 23,951
  • 1
  • 40
  • 64
  • Interesting that the knowledge that `x'*x` is square and symmetric is built into the parser! I'll do some more experiments later using more robust timings and using larger arguments. – Chris Taylor Apr 03 '14 at 10:41
  • 1
    @ChrisTaylor I should qualify my answer a bit. Perhaps I should have said "...is built into MATLAB at something like the parser level, although I'm not sure quite where, maybe it's the JIT rather than the parser, but certainly higher than \". For that reason, I would include in your experiments the idea of turning the JIT off and on. Hope that helps. – Sam Roberts Apr 03 '14 at 10:47