3

What is the optimal crossover point under which Strassen's Algorithm should stop the recursion and apply the multiplication, in terms of efficiency?

I know this is closely connected to the specific implementation and hardware but there should be some kind of guideline or some experimental results from someone for a general case.

Searching a bit online and asking some people they tend to think it's

n = 64; 

or

 n = 32;

Can anyone verify / reject these results?

peterh
  • 11,875
  • 18
  • 85
  • 108
kxk
  • 576
  • 2
  • 11
  • 30

3 Answers3

1

This should be tuned on a per machine basis (a bit like ATLAS does). This kind of optimizations pay off for quite large matrices: if you code it yourself, and compare it to eg. a vendor BLAS implementation, then you would find a quite large n.

Memory requirements for Strassen algorithm are also something to weigh.

Alexandre C.
  • 55,948
  • 11
  • 128
  • 197
1

On my dual core 2.66 Mac Pro using [my implementation][1] the cross-over is as small as n = 16. In fact, my implementation is way faster than the conventional algorithm for large matrices -- I am not sure why -- I think it is more cache-friendly since it focuses on smaller localized data. In fact, I am about to post a question about this.

http://ezekiel.vancouver.wsu.edu/~cs330/lectures/linear_algebra/mm/mm.c

wcochran
  • 10,089
  • 6
  • 61
  • 69
  • I tested your code on my PC, finding that the cross-over is still n =64. However, your code is still 20 times faster than my ugly C++ version. – Mike_Dog Jun 17 '15 at 14:47
0

After a lot of tests, I have concluded that at least for my processor the optimal crossover point under which Strassen's Algorithm is n = 128.

My processor is: Intel Core i5-430M. Also, interestingly enough for a 4-threaded CPU, my implementation worked a little bit better for numberOfProcesses = 8 than numberOfProcesses = 4. I have no clue how or why that happened. I'd guess that due to more communication through channels it would have a bigger overhead and since they cannot all work at the same time it would definitely be a bit slower. Apparently I was wrong. If anyone can explain this btw please drop a line, just for the record.

Thanks.

kxk
  • 576
  • 2
  • 11
  • 30
  • 1
    The actual crossover point depends on quite a lot of factors, among which the actual instruction set, the parallelism level used for the naive small matrices multiplication (I assume you used a vendor specific BLAS implementation which may use multiple threads), memory access speed, cache considerations, hyperthreading considerations, the way you store intermediate results, possibly the way the OS schedules the threads, etc. Most probably a vendor specific BLAS implementation will switch to Strassen algorithm itself, so you may actually be using it without knowing.Moral:don't do this yourself. – Alexandre C. Mar 28 '11 at 11:10
  • Ye fair enough. It's for a small application though so I do not expect to have severe problems since the Matrices I will be using will not be big. Was just being a perfectionist. Thank you for all the info, I will keep it in mind. – kxk Mar 28 '11 at 11:59