How to optimize this product of three matrices in C++ for x86?

Question

I have a key algorithm in which most of its runtime is spent on calculating a dense matrix product:

A*A'*Y, where: A is an m-by-n matrix, 
               A' is its conjugate transpose,
               Y is an m-by-k matrix

Typical characteristics:
    - k is much smaller than both m or n (k is typically < 10)
    - m in the range [500, 2000]
    - n in the range [100, 1000]

Based on these dimensions, according to the lessons of the matrix chain multiplication problem, it's clear that it's optimal in a number-of-operations sense to structure the computation as A*(A'*Y). My current implementation does that, and the performance boost from just forcing that associativity to the expression is noticeable.

My application is written in C++ for the x86_64 platform. I'm using the Eigen linear algebra library, with Intel's Math Kernel Library as a backend. Eigen is able to use IMKL's BLAS interface to perform the multiplication, and the boost from moving to Eigen's native SSE2 implementation to Intel's optimized, AVX-based implementation on my Sandy Bridge machine is also significant.

However, the expression A * (A.adjoint() * Y) (written in Eigen parlance) gets decomposed into two general matrix-matrix products (calls to the xGEMM BLAS routine), with a temporary matrix created in between. I'm wondering if, by going to a specialized implementation for evaluating the entire expression at once, I can arrive at an implementation that is faster than the generic one that I have now. A couple observations that lead me to believe this are:

Using the typical dimensions described above, the input matrix A usually won't fit in cache. Therefore, the specific memory access pattern used to calculate the three-matrix product would be key. Obviously, avoiding the creation of a temporary matrix for the partial product would also be advantageous.
A and its conjugate transpose obviously have a very related structure that could possibly be exploited to improve the memory access pattern for the overall expression.

Are there any standard techniques for implementing this sort of expression in a cache-friendly way? Most optimization techniques that I've found for matrix multiplication are for the standard A*B case, not larger expressions. I'm comfortable with the micro-optimization aspects of the problem, such as translating into the appropriate SIMD instruction sets, but I'm looking for any references out there for breaking this structure down in the most memory-friendly manner possible.

Edit: Based on the responses that have come in thus far, I think I was a bit unclear above. The fact that I'm using C++/Eigen is really just an implementation detail from my perspective on this problem. Eigen does a great job of implementing expression templates, but evaluating this type of problem as a simple expression just isn't supported (only products of 2 general dense matrices are).

At a higher level than how the expressions would be evaluated by a compiler, I'm looking for a more efficient mathematical breakdown of the composite multiplication operation, with a bent toward avoiding unneeded redundant memory accesses due to the common structure of A and its conjugate transpose. The result would likely be difficult to implement efficiently in pure Eigen, so I would likely just implement it in a specialized routine with SIMD intrinsics.

The temporary matrix you mention - could that have to do with this? http://eigen.tuxfamily.org/dox-devel/group__TutorialMatrixArithmetic.html#TutorialArithmeticMatrixMul. It's interesting that you have A A' Y by the way, instead of the more common A Y A'. In addition, I don't know how you construct A, but maybe you can construct A A' as you go along (not sure if there is an efficient way). — CompuChip, Feb 27 '14 at 15:47
@CompuChip: No, I think it has more to do with the fact that when evaluating the matrix product, it only can calculate the product of two matrices at a time. Thus, you need a temporary in order to multiply by the third matrix. Note that due to the dimensions of `A`, constructing `A*A'` and then multiplying by it results in many more required operations (hence the link to the matrix chain multiplication article), so it's much better to insert the parentheses as shown. — Jason R, Feb 27 '14 at 15:49
I presume you've looked at the number of threads used by Eigen / disabling hyper-threading, as in the accepted answer here : http://stackoverflow.com/questions/14783219/how-to-speed-up-eigen-librarys-matrix-product ? — Graham Griffiths, Feb 27 '14 at 15:52
@GrahamGriffiths: Thanks for the idea. I'm currently running in single-threaded mode for reasons outside the scope of this particular question; I'm really looking just at algorithmic optimization. — Jason R, Feb 27 '14 at 15:53
The standard solution is expression templates, but Eigen apparently [already implements](http://eigen.tuxfamily.org/dox/TopicLazyEvaluation.html) them. Be sure you're doing whatever is required to take advantage of that. — bames53, Feb 27 '14 at 16:16
so I think Y fits in the cache, and also the partial product (A.adjoint() * Y) fits in the cache. Is there a matrix mult optimization where just one of your matrices fits wholly in the cache? — Graham Griffiths, Feb 27 '14 at 16:46
@GrahamGriffiths There are [cache oblivious](http://en.wikipedia.org/wiki/Cache-oblivious_algorithm) algorithms for matrix multiplication. — bames53, Feb 27 '14 at 17:10
How often do matrices A and Y change? If A changes seldom, could you do the product A*A' outside the loop? — Mike Dunlavey, Feb 27 '14 at 17:18
@MikeDunlavey: `A` changes more often than `Y` does. However, see my comment below on ggael's answer as to why forming the product `A*A'` is a bad idea. — Jason R, Feb 27 '14 at 18:59
@sebas: Both `A` and `Y` are dense. I updated the OP to reflect this. — Jason R, Feb 27 '14 at 19:31
How do you know it's not optimized? Have you calculated the FLOPS and compared to the peak FLOPS? The number of flops should be `2.0*m*n*m + 2.0*m*m*k` I think. Note that I think it's going to be a huge investment in time to try and optimize this better than MKL so this is only worth doing for education purposes. — Z boson, Feb 28 '14 at 08:51
Note also that for large dense matrix multiplication, if the algorithm is optimized (as it is in MKL), the cache does not matter: it's computation bound not memory bound. — Z boson, Feb 28 '14 at 09:02
Ah I just read the matrix chain link. The flops for (A*A')Y is 2.0*m*n*m + 2.0*m*m*k but for A*(A'Y) it's 2.0*(m*x*k) + 2.0*(n*m*k). No wonder it's faster. — Z boson, Feb 28 '14 at 09:08
Migrate to scientific computation. This Q involves intel mkl, matrix-matrix multiplication, the experts are not here (programmers will suggest writing your own thing, which is generally the wrong answer for num comp). — Phil H, Feb 28 '14 at 09:36

Z boson · Accepted Answer · 2014-02-28T10:58:18.083

This is not a full answer (yet - and I'm not sure it will become one).

Let's think of the math first a little. Since matrix multiplication is associative we can either do (A*A')Y or A(A'*Y).

Floating point operations for (A*A')*Y

2*m*n*m + 2*m*m*k //the twos come from addition and multiplication

Floating point operations for A*(A'*Y)

2*m*n*k + 2*m*n*k = 4*m*n*k

Since k is much smaller than m and n it's clear why the second case is much faster.

But by symmetry we could in principle reduce the number of calculations for A*A' by two (though this might not be easy to do with SIMD) so we could reduce the number of floating point operations of (A*A')*Y to

m*n*m + 2*m*m*k.

We know that both m and n are larger than k. Let's choose a new variable for m and n called z and find out where case one and two are equal:

z*z*z + 2*z*z*k = 4*z*z*k  //now simplify
z = 2*k.

So as long as m and n are both more than twice k the second case will have less floating point operations. In your case m and n are both more than 100 and k less than 10 so case two uses far fewer floating point operations.

In terms of efficient code. If the code is optimized for efficient use of the cache (as MKL and Eigen are) then large dense matrix multiplication is computation bound and not memory bound so you don't have to worry about the cache. MKL is faster than Eigen since MKL uses AVX (and maybe fma3 now?).

I don't think you will be able to do this more efficiently than you're already doing using the second case and MKL (through Eigen). Enable OpenMP to get maximum FLOPS.

You should calculate the efficiency by comparing FLOPS to the peak FLOPS of you processor. Assuming you have a Sandy Bridge/Ivy Bridge processor. The peak SP FLOPS is

frequency * number of physical cores * 8 (8-wide AVX SP) * 2 (addition + multiplication)

For double precession divide by two. If you have Haswell and MKL uses FMA then double the peak FLOPS. To get the frequency right you have to use the turbo boost values for all cores (it's lower than for a single core). You can look this up if you have not overclocked your system or use CPU-Z on Windows or Powertop on Linux if you have an overclocked system.

As far as I can tell MKL supports FMA3 now for Haswell http://software.intel.com/en-us/articles/whats-new-in-intel-mkl. MKL GEMM is probably a lot faster on Haswell then. — Z boson, Feb 28 '14 at 10:43
Thanks for the observations. Shame on me for prematurely optimizing without doing an isolated benchmark and comparison against the theoretical numbers. For `m=1000, n=100, k=4`, I'm seeing around 48-50 GLOPS/sec on my Sandy Bridge machine @ 3.4 GHz, so that's pretty close to the theoretical maximum. Using 4 threads, I can get about a 2.7x scaling as well. I think you're right, I don't know that there's anything to be gained here unless there was a way to significantly reduce the number of overall ops. — Jason R, Feb 28 '14 at 18:36
@JasonR, what formula did you use to calculate the flops? For case two FLOPS = 4*m*n*k/time. So for m=1000, n=100, k=4 that's 1.6*10^6/time -> time is 32µs. — Z boson, Mar 03 '14 at 12:00
I used that formula. I had to tweak it a bit because I'm using complex arithmetic (1 complex multiplication = 4 real multiplications and 4 real additions and 1 complex addition = 2 real additions). — Jason R, Mar 03 '14 at 12:06
Technically it's not exactly a factor of two. The dot product is n multiplications and n-1 additions. For a square matrix it's then n*n*(n + n-1) flops = 2*n*n*n - n*n. Most people throw out the second term for large matrices. — Z boson, Mar 03 '14 at 12:13
But for BLAS GEMM, I think you would consider each dot product to be `n` multiplications and `n` additions, since you're forming the result `C + A*B`. In my case `C` would be initialized with zeros, but those additions still would be executed. — Jason R, Mar 03 '14 at 12:36
Oh, good point! A*B has 2*n*n*n - n*n flops but A*B + C is 2*n*n*n flops. I just learned something. Thanks! — Z boson, Mar 03 '14 at 12:41

score 0 · Answer 2 · answered Feb 27 '14 at 15:59

0

Use a temporary matrix to compute A'*Y, but make sure you tell eigen that there's no aliasing going on: temp.noalias() = A.adjoint()*Y. Then compute your result, once again telling eigen that objects aren't aliased: result.noalias() = A*temp.

answered Feb 27 '14 at 15:59

David Hammen

32,454
9
60
108

Thanks for the suggestion. I don't believe the construction of the temporary itself is the critical path. Adjusting the expression to use explicit temporaries that are specified to be non-aliased yields a speedup of a couple percent or so. I'm instead after a better scheme of evaluating the overall expression that would have more favorable memory access patterns on the matrices involved. – Jason R Feb 27 '14 at 16:05

score 0 · Answer 3 · answered Feb 27 '14 at 16:57

0

There would be redundant computation only if you would perform (A*A')*Y since in this case (A*A') is symmetric and only half of the computation are required. However, as you noticed it is still much faster to perform A*(A'*Y) in which case there is no redundant computations. I confirm that the cost of the temporary creation is completely negligible here.

answered Feb 27 '14 at 16:57

ggael

28,425
2
65
71

I guess I am not understanding this but how will `(A'*Y)` be faster? – John Odom Feb 27 '14 at 18:08
@JohnOdom: See [this page](http://www.personal.kent.edu/~rmuhamma/Algorithms/MyAlgorithms/Dynamic/chainMatrixMult.htm) for a description of the differences in total number of operations that can occur just by changing the associativity of a matrix multiplication expression. In this case, since `k` (the number of rows of `Y`) is much smaller than either dimension of `A`, it behooves you to do that multiplication first, as the intermediate result will also only have `k` columns. If you do 'A*A'` first, forming that product requires many more operations. – Jason R Feb 27 '14 at 18:33

score 0 · Answer 4 · answered Feb 27 '14 at 18:51

I guess that perform the following

result = A * (A.adjoint() * Y)

will be the same as do that

temp = A.adjoint() * Y
result = A * temp;

If your matrix Y fits in the cache, you can probably take advantage of doing it like that

result = A * (Y.adjoint() * A).adjoint()

or, if the previous notation is not allowed, like that

temp = Y.adjoint() * A
result = A * temp.adjoint();

Then you dont need to do the adjoint of matrix A, and store the temporary adjoint matrix for A, that will be much expensive that the one for Y.

If your matrix Y fits in the cache, it should be much faster doing a loop runing over the colums of A for the first multiplication, and then over the rows of A for the second multimplication (having Y.adjoint() in the cache for the first multiplication and temp.adjoint() for the second), but I guess that internally eigen is already taking care of that things.

Thanks for the suggestion. I tried that variation (computing `A*(Y'*A)'` instead) and saw no difference in performance. I do know that Eigen is smart enough to not explicitly calculate the adjoint of `A`. The return value of `A.adjoint()` is an expression object that just wraps `A` with the storage order reversed (from column-major to row-major). So, it doesn't actually make a copy of `A` or anything. — Jason R, Feb 27 '14 at 19:10

How to optimize this product of three matrices in C++ for x86?

4 Answers4