how to optimize matrix multiplication in cpp in terms of time complexity?

Question

given any 2 matrics a and b (which don't have special properties) do we have a better way of computing the multiplication than this:?

for(i=0; i<r1; ++i)
for(j=0; j<c2; ++j)
for(k=0; k<c1; ++k)
{
    mult[i][j]+=a[i][k]*b[k][j];
}

It very much depends on the size really. If you have a large matrix you could parallelize it with e.g. threads or other similar techniques. — Some programmer dude, Apr 07 '16 at 19:00
Use a library like `Eigen` or `MKL` where they have vectorized the math for you. — Matt, Apr 07 '16 at 19:37
This answer might be helpful: https://stackoverflow.com/a/54546544/3234205 — Jianyu Huang, Feb 06 '19 at 04:36

PepeHands · Accepted Answer · 2016-04-07T19:51:44.377

7

If you are curious if they exist in theory, then yes. For example, Strassen algorithm (see https://en.wikipedia.org/wiki/Strassen_algorithm). And it's not even the fastest we know. As far as I'm concerned the best for now is Coppersmith–Winograd algorithm (see https://en.wikipedia.org/wiki/Coppersmith%E2%80%93Winograd_algorithm) and it is something like O(n^{2.37}) (Strassen's time complexity is something like O(n^{2.8}).

But in practice they are much harder to implement than the one you wrote and also they have pretty large time constant hidden under O() so O(n^3) algorithm you wrote is even better on low values of n and much easier to implement.

Also there is a Strassen's hypothesis which claims that for every eps > 0 there is an algorithm which multiplies two matrixes with time complexity O(n^{2 + eps}). But as you might have noticed it is just an hypothesis for now.

edited Apr 07 '16 at 19:51

answered Apr 07 '16 at 19:37

PepeHands

1,368
5
20
36

1

Note that Strassen's algorithm uses matrix addition as a subroutine, which is relatively easy to parallelize even on one processor using SIMD. Strassen's algorithm, if implemented well,can perform better than the naive method for `n=16` or `n=32`, which I consider a relatively small size. – Codor Apr 07 '16 at 20:27
1

@Codor, do you have a source for that? I have seen some hybrid methods which uses Strassen for `n>1000` large but `n=32` or smaller is very small and I doubt Strassen helps for that. – Z boson Apr 08 '16 at 06:55
Perhaps you are right; [this Paper](http://www.ijcst.com/vol31/4/juby.pdf) seems to conclude that Strassen's algorithm is competitive for `n=512`, see _Table_ _1_. However, apparently neither SIMD for addition nor an improved memory locality via so-called [Morton layout](https://en.wikipedia.org/wiki/Z-order_curve) is used. – Codor Apr 08 '16 at 09:11
Please check these two papers if you are interested in the high-performance implementation of Strassen's algorithm (SIMD for addition / Morton layout for multi-level Strassen are used): 1. "[Strassen's Algorithm Reloaded](http://jianyuhuang.com/papers/sc16.pdf)”, in SC16. 2. "[Generating Families of Practical Fast Matrix Multiplication Algorithms](http://jianyuhuang.com/papers/ipdps17.pdf)”, in IPDPS17. – Jianyu Huang Mar 10 '19 at 23:12

score 3 · Answer 2 · answered Apr 07 '16 at 19:47

3

As a very easy solution you can transpose the second matrix before multiplication, so your code will get much less processor cache misses. The complexity will be the same but it may improve a time constant a bit.

answered Apr 07 '16 at 19:47

Victor

418
2
10

score 2 · Answer 3 · answered Apr 07 '16 at 19:51

2

These are the problems that many bright souls in this world have solved before you. Do not torture yourself and use BLAS ?GEMM.

http://www.netlib.org/blas/#_level_3

answered Apr 07 '16 at 19:51

Chiel

6,006
2
32
57

score 1 · Answer 4 · answered Oct 29 '17 at 20:17

This is a good question that deserves a more complete answer than "use a library".

Of course, if you want to do a good job, you probably should not try to write it yourself. But if this question is about learning how to do matrix multiplication faster, here is a complete answer.

As a practical matter, the code you show writes to memory too much. If the inner loop adds the dot product in a scalar variable, then only write at the end, the code will be faster. Most compilers are not smart enough to understand this.

double dot = 0; for(k=0; k

This also improves multi-core performance, since if you use multiple cores they have to share memory bandwidth. If you are using an array of rows, switch your representation to a single block of memory.

As mentioned by someone above, you can do a transpose so the matrix traversals are both in sequential order. Memory is designed to efficiently read in sequentially, but your b[k][j] is jumping around, so this is about 3x faster typically as the size gets big (on the order of 1000x1000, the cost of the initial transpose is negligable).
When the matrix gets large enough, Strassen and Coppersmith-Winograd are faster ways of multiplying that fundamentally change the rules, but they do so by cleverly rearranging terms to achieve the same theoretical result with a lower complexity bound. In practice, they change the answer because roundoff error is different and for large matrices, the answers produced by these algorithms is likely to be far worse than the brute force multiplication.
If you have a truly parallel computer, you can copy the matrix to multiple CPUs and have them work in parallel on the answer.
You can put the code onto your video card and use the far more parallel CPUs there which have far more memory bandwidth. That's probably the most effective way to get a real speedup on your computer (assuming you have a graphics card). See CUDA or Vulkan.

The fundamental problem is that multiple cores don't help much for matrix multiplication because you are limited by memory bandwidth. That's why doing it on a video card is so good, because bandwidth there is far higher.

score 0 · Answer 5 · answered Apr 07 '16 at 20:17

You could use multiple threads by dividing the multiplication to them. So divide the lines/columns of the first dimension of the first matrix or the last dimension of the last into a number of tasks equal to the cores you have in your processor. If these aren't evenly divisible, some cores will have to do an extra cycle. But any way, the idea is give the multiplication to more cores and divide e.g. the first matrix in 4 parts ( I have 4 cores), do the multiplication with 4 tasks, and reassemble (that isn't necessary as the cores may work on the same data).

how to optimize matrix multiplication in cpp in terms of time complexity?

5 Answers5