Matrix multiplication: Strassen vs. Standard

Question

I tried to implement the Strassen algorithm for matrix multiplication with C++, but the result isn't that, what I expected. As you can see strassen always takes more time then standard implementation and only with a dimension from a power of 2 is as fast as standard implementation. What went wrong? alt text

matrix mult_strassen(matrix a, matrix b) {
if (a.dim() <= cut)
    return mult_std(a, b);

matrix a11 = get_part(0, 0, a);
matrix a12 = get_part(0, 1, a);
matrix a21 = get_part(1, 0, a);
matrix a22 = get_part(1, 1, a);

matrix b11 = get_part(0, 0, b);
matrix b12 = get_part(0, 1, b);
matrix b21 = get_part(1, 0, b);
matrix b22 = get_part(1, 1, b);

matrix m1 = mult_strassen(a11 + a22, b11 + b22); 
matrix m2 = mult_strassen(a21 + a22, b11);
matrix m3 = mult_strassen(a11, b12 - b22);
matrix m4 = mult_strassen(a22, b21 - b11);
matrix m5 = mult_strassen(a11 + a12, b22);
matrix m6 = mult_strassen(a21 - a11, b11 + b12);
matrix m7 = mult_strassen(a12 - a22, b21 + b22);

matrix c(a.dim(), false, true);
set_part(0, 0, &c, m1 + m4 - m5 + m7);
set_part(0, 1, &c, m3 + m5);
set_part(1, 0, &c, m2 + m4);
set_part(1, 1, &c, m1 - m2 + m3 + m6);

return c; 
}

PROGRAM
matrix.h http://pastebin.com/TYFYCTY7
matrix.cpp http://pastebin.com/wYADLJ8Y
main.cpp http://pastebin.com/48BSqGJr

g++ main.cpp matrix.cpp -o matrix -O3.

You should provide the current result and the expected one (and maybe the multiply function) if you want someone to help you. Putting the whole code is too much. An other thing, if the question is for homework, please add the homework flag — Patrice Bernassola, Nov 29 '10 at 14:24
This is not homework. I'm just interested in implementing the strassen algorithm, because it should be faster. Standard has complexity of O(n^3) and strassen O(n^2.8), because it needed one multiplication less. — multiholle, Nov 29 '10 at 14:40

score 8 · Accepted Answer · answered Nov 29 '10 at 14:57

Some thoughts:

Have you optimized it to consider that a non-power of two sized matrix is filled in with zeroes? I think the algorithm assumes you don't bother multiplying these terms. This is why you get the flat areas where the running time is constant between 2^n and 2^(n+1)-1. By not multiplying terms you know are zero you should be able to improve these areas. Or perhaps Strassen is only meant to work with 2^n sized matrices.
Consider that a "large" matrix is arbitrary and the algorithm is only slightly better than the naive case, O(N^3) vs O(N^2.8). You may not see measurable gains until bigger matrices are tried. For example, I was did some Finite Element modeling where 10,000x10,000 matrices were considered "small". Its hard to tell from your graph but it looks like the 511 case may be faster in the Stassen case.
Try testing with various optimization levels including no optimizations at all.
This algorithm seems to assume that multiplications are much more expensive than additions. This was certainly true 40 years ago when it was first developed but I believe in more modern processors the difference between add and multiply has gotten smaller. This may reduce the effectiveness of the algorithm which seems to reduce multiplications but increase additions.
Have you looked at some of the other Strassen implementations out there for ideas? Try benchmarking a known good implementation to see exactly how much faster you can get.

Modifying the storage order of the matrix when using Strassen's algorithm (using Z-order) may also help speed things up, by making the memory accesses more cache friendly. — void-pointer, Nov 27 '11 at 22:03

score 2 · Answer 2 · answered Nov 29 '10 at 15:01

Ok I am no expert in this field, but there might be other issues at work here than processing speed. First the strassen method uses way more stack and has more function calls, which add memory movement. You have a certain penalty the larger your stack gets, since it needs to request larger frames from the OS. Plus you use dynamic allocation, this is also an issue.

Try to use a fixed size (with template parameter) matrix class? This will at least resolve the allocation issue.

Note: I am not sure that it event works properly with your code. Your matrix class uses pointers but has no copy constructor or assignment operator. You are also leaking memory at the end, since you don't have a destructor...

score 2 · Answer 3 · answered Nov 29 '10 at 15:09

The big O of Strassen is O(N ^ log 7) compared to O(N ^ 3) regular, i.e. log 7 base 2 which is slightly less than 3.

That is the number of multiplications you need to make.

It assumes there is no cost to anything else you have, and also should be "faster" only as N gets large enough which yours probably does not.

Much of your implementation is creating lots of sub-matrices and my guess is the way you are storing them you are having to allocate memory and copy every time you do this. Having some kind of "slice" matrix and logical-transpose matrix if you can would help you optimise what is probably the slowest part of your process.

wcochran · Answer 4 · 2022-12-11T04:52:11.253

1

I am actually shocked at how much faster my Stassen multiplcation implementation is:

https://github.com/wcochran/strassen_multiplier/blob/master/mm.c

I get an almost 16x speedup on my machine when n=1024. The only way I can explain this much of a speed-up is that my algorithm is more cache-friendly -- i.e., it focuses on small pieces of the matrices and thus the data is more localized.

The overhead in your C++ implementation is probably too high -- the compiler generates more temporaries than what is really necessary. My implementation tries to minimizing this by reusing memory whenever possible.

edited Dec 11 '22 at 04:52

answered Oct 19 '11 at 20:20

wcochran

10,089
6
61
69

For 1024 x 1024 matrices, you need a blocking algorithm - the naive implementation will be awfully slow because you get one average one cache miss per multiplication. And you really need to investigate how the time changes if you take n = 1023 or 1025, for example and possibly change the layout of your matrices accordingly. – gnasher729 Apr 03 '14 at 17:29
Link to code is broken. – CubicleSoft Dec 09 '22 at 20:51

score 0 · Answer 5 · answered Nov 29 '10 at 14:26

0

Long shot, but have you considered that the standard multiplication may be optimised by the compiler? Could you switch off optimisations?

answered Nov 29 '10 at 14:26

Eamorr

9,872
34
125
209

Switching off the optimisations increase the time but gives the same graph. – multiholle Nov 29 '10 at 14:37
Mmm. I don't think I can help you any further. It seems there are a lot of function calls accessing main memory in your Strassen implementation. Perhaps this is the bottle-neck? The naive multiplication would utilise the registers quite a bit. I'd recommend using a BLAS library for matrix multiplication. – Eamorr Nov 29 '10 at 14:46
2

Switching off optimisations kind of kills the point of any benchmark which is expected to give results useful in real life. – Kos Nov 29 '10 at 15:17
1

Wow. What's the neg rep for? I didn't see you make any suggestion. – Eamorr Nov 29 '10 at 17:03

Matrix multiplication: Strassen vs. Standard

5 Answers5

Linked