7

I was wondering how can matlab multiply two matrices so fast. When multiplying two NxN matrices, N^3 multiplications are performed. Even with the Strassen Algorithm it takes N^2.8 multiplications, which is still a large number. I was running the following test program:

a = rand(2160);
b = rand(2160);
tic;a*b;toc

2160 was used because 2160^3=~10^10 ( a*b should be about 10^10 multiplications)

I got:

Elapsed time is 1.164289 seconds.

(I'm running on 2.4Ghz notebook and no threading occurs) which mean my computer made ~10^10 operation in a little more than 1 second.

How this could be??

angainor
  • 11,760
  • 2
  • 36
  • 56
Mercury
  • 1,886
  • 5
  • 25
  • 44

2 Answers2

13

It's a combination of several things:

  • Matlab does indeed multi-thread.
  • The core is heavily optimized with vector instructions.

Here's the numbers on my machine: Core i7 920 @ 3.5 GHz (4 cores)

>> a = rand(10000);
>> b = rand(10000);
>> tic;a*b;toc
Elapsed time is 52.624931 seconds.

Task Manager shows 4 cores of CPU usage.

Now for some math:

Number of multiplies = 10000^3 = 1,000,000,000,000 = 10^12

Max multiplies in 53 secs =
    (3.5 GHz) * (4 cores) * (2 mul/cycle via SSE) * (52.6 secs) = 1.47 * 10^12

So Matlab is achieving about 1 / 1.47 = 68% efficiency of the maximum possible CPU throughput.

I see nothing out of the ordinary.

Mysticial
  • 464,885
  • 45
  • 335
  • 332
  • matrix matrix multiplication also performs adds, not only muls. I think that your performance estimates should include 4 FLOPs/cycle via SSE, but twice as many operations. Am I correct? And is your MATLAB not using AVX-enabled BLAS? – angainor Oct 04 '12 at 07:56
  • 1
    @angainor It's actually one `add` and one `mul` per cycle. Each one can be SSE. However, the add and muls are separate execution units, so you can't "double up" on one if you don't use the other. – Mysticial Oct 04 '12 at 07:59
  • Thats correct. Just for the sake of this analysis, adds should be included. The results are the same, just that you made a non-trivial shortcut. Might be hard to understand for someone who does not know what you wrote in the comment. – angainor Oct 04 '12 at 08:01
  • The OP was only counting multiplications. I didn't want to confuse him with additions as well (even though they come out to be exactly the same). – Mysticial Oct 04 '12 at 08:02
  • By the way, how is it with AVX? I just assumed its 8 FLOPs per cycle. Is it also split between adds and muls? – angainor Oct 04 '12 at 08:04
  • That machine (Core i7 920) is the first generation i7. It doesn't have AVX. My sandbox does (an overclocked 2600K). But I don't have Matlab installed on it. – Mysticial Oct 04 '12 at 08:06
5

To check whether you do or not use multi-threading in MATLAB use this command

maxNumCompThreads(n)

This sets the number of cores to use to n. Now I have a Core i7-2620M, which has a maximum frequency of 2.7GHz, but it also has a turbo mode with 3.4GHz. The CPU has two cores. Let's see:

A = rand(5000);
B = rand(5000);
maxNumCompThreads(1);
tic; C=A*B; toc
Elapsed time is 10.167093 seconds.

maxNumCompThreads(2);
tic; C=A*B; toc
Elapsed time is 5.864663 seconds.

So there is multi-threading.

Let's look at the single CPU results. A*B executes approximately 5000^3 multiplications and additions. So the performance of single-threaded code is

5000^3*2/10.8 = 23 GFLOP/s

Now the CPU. 3.4 GHz, and Sandy Bridge can do maximum 8 FLOPs per cycle with AVX:

3.4 [Ginstructions/second] * 8 [FLOPs/instruction] = 27.2 GFLOP/s peak performance

So single core performance is around 85% peak, which is to be expected for this problem.

You really need to look deeply into the capabilities of your CPU to get accurate performannce estimates.

angainor
  • 11,760
  • 2
  • 36
  • 56