Why is x \ y so much slower than (x' * x) \ (x' * y)?

Question

For an NxP matrix x and an Nx1 vector y with N > P, the two expressions

x \ y                                       -- (1)

and

(x' * x) \ (x' * y)                         -- (2)

both compute the solution b to the matrix equation

x * b = y

in the least squares sense, i.e. so that the quantity

norm(y - x * b)

is minimized. Expression (2) does it using the classic algorithm for the solution of an ordinary least squares regression, where the left-hand argument to the \ operator is square. It is equivalent to writing

inv(x' * x) * (x' * y)                      -- (3)

but it uses an algorithm which is more numerically stable. It turns out that (3) is moderately faster than (2) even though (2) doesn't have to produce the inverse matrix as a byproduct, but I can accept that given the additional numerical stability.

However, some simple timings (with N=100,000 and P=30) show that expression (2) is more than 5 times faster than expression (1), even though (1) has greater flexibility to choose the algorithm used! For example, any call to (1) could just dispatch on the size of X, and in the case N>P it could reduce to (2), which would add a tiny amount of overhead, but certainly wouldn't take 5 times longer.

What is happening in expression (1) that is causing it to take so much longer?

Edit: Here are my timings

x = randn(1e5, 30);
y = randn(1e5,1);
tic, for i = 1:100; x\y; end; t1=toc;
tic, for i = 1:100; (x'*x)\(x'*y); end; t2=toc;
assert( abs(norm(x\y) - norm((x'*x)\(x'*y))) < 1e-10 );
fprintf('Speedup: %.2f\n', t1/t2)

Speedup: 5.23

http://stackoverflow.com/questions/18553210/ and http://scicomp.stackexchange.com/questions/1001/ may be of interest, in particular the `spparms('spumoni',1)` trick mentioned in the latter by rcompton. — nkjt, Apr 03 '14 at 09:18

Rody Oldenhuis · Answer 1 · 2014-04-03T11:56:02.307

5

You are aware of the fact that in your test

size(x) == [1e5  30]    but   size(x'*x) == [30  30]
size(y) == [1e5   1]    but   size(x'*y) == [30   1]

That means that the matrices entering the mldivide function differ in size by 4 orders of magnitude! This would render any overhead of determining which algorithm to use rather large and significant (and perhaps also running the same algorithm on the two different problems).

In other words, you have a biased test. To make a fair test, use something like

x = randn(1e3);
y = randn(1e3,1);

I find (worst of 5 runs):

Speedup: 1.06  %// R2010a
Speedup: 1.16  %// R2010b
Speedup: 0.97  %// R2013a

...the difference has all but evaporated.

But, this does show very well that if you indeed have a regression problem with low dimensionality compared to the number of observations, it really pays off to do the multiplication first :)

mldivide is a catch-all, and really great at that. But often, having knowledge about the problem may make more specific solutions, like pre-multiplication, pre-conditioning, lu, qr, linsolve, etc. orders of magnitude faster.

edited Apr 03 '14 at 11:56

answered Apr 03 '14 at 11:41

Rody Oldenhuis

37,726
7
50
96

1

Rody, that is exactly my point! When you enter `mldivide(x,y)` with more observations than features, it is *mathematically equivalent* to `mldivide(x'*x,x'*y)` so I am wondering why MATLAB doesn't detect that case and do the pre-multiplication itself (since the second one is 5-10x faster even when you take the time required for the multiplications into account). – Chris Taylor Apr 03 '14 at 12:23
@ChrisTaylor: @ChrisTaylor: it is not *quite* the same...Try both versions on this: `x=[1 1; 0 3*eps; 3*eps 0]; y=[2; 3*eps; 3*eps];` – Rody Oldenhuis Apr 03 '14 at 12:38
@ChrisTaylor: also, "mathematically equivalent" has often little to do with how computers work :) Try this classic: `1.1-0.2 == 0.9` – Rody Oldenhuis Apr 03 '14 at 12:43
Ok, I take your point - you can lose precision if you do the pre-multiplication resulting in `x'*x` being singular. Calling `x\y` directly uses (presumably) a more stable algorithm but slower so it gets the right answer. I think that's the answer I'm looking for (though I'm still unsure why you can't check to see if a pre-multiplication would be valid, and perform it if so). – Chris Taylor Apr 03 '14 at 13:29
@ChrisTaylor: well, that's the thing with closed source; you'll never be *entirely* sure of implementation details, design decisions, etc. I'm pretty sure they have good reasons not to do what you suggest. The multiplication you suggest quite likely involves multiplications of numbers of very different magnitudes. This is very prone to [loss of significance](http://en.wikipedia.org/wiki/Loss_of_significance); reason enough for me to decide *against* using it in the catch-all. I agree, probably some optimization could be made there, but then again, *simplicity* is also a virtue... – Rody Oldenhuis Apr 03 '14 at 17:06
Rody, I feel like your second comment answers the question perfectly - any chance you want to edit to include it? – Chris Taylor Apr 05 '14 at 14:37

Sam Roberts · Answer 2 · 2014-04-03T10:33:58.597

even though (1) has greater flexibility to choose the algorithm used! For example, any call to (1) could just dispatch on the size of X, and in the case N>P it could reduce to (2), which would add a tiny amount of overhead, but certainly wouldn't take 5 times longer.

This is not the case. It could take a lot of overhead to choose which algorithm to use, particularly when compared to the computation on relatively small inputs such as these. In this case, because MATLAB can see that you have x'*x, it knows that one of the arguments must be both square and symmetric (yes - that knowledge of linear algebra is built in to MATLAB even at a parser level), and can straight away call one of the appropriate code paths within \.

I can't say whether this fully explains the timing differences you're seeing. I would want to investigate further, at least by:

Making sure to put the code within a function, and warming the function up to ensure that the JIT is engaged - and then trying the same thing with feature('accel', 'off') to remove the effect of the JIT
Trying this on a much bigger range of input sizes to check what contribution an 'algorithm choice overhead' made compared to computation time.

Interesting that the knowledge that `x'*x` is square and symmetric is built into the parser! I'll do some more experiments later using more robust timings and using larger arguments. — Chris Taylor, Apr 03 '14 at 10:41
@ChrisTaylor I should qualify my answer a bit. Perhaps I should have said "...is built into MATLAB at something like the parser level, although I'm not sure quite where, maybe it's the JIT rather than the parser, but certainly higher than \". For that reason, I would include in your experiments the idea of turning the JIT off and on. Hope that helps. — Sam Roberts, Apr 03 '14 at 10:47

Why is x \ y so much slower than (x' * x) \ (x' * y)?

2 Answers2