How to avoid the 'code optimization' in operations like matrix multiplication?

Question

I need to compare the running time of two algorithms Alg_x, Alg_y. However, Alg_x contains many matrix multiplications, Alg_y contains many element-wise operations (e.g., summation and muliplicaton of each two numbers/vectors). Theoretically, Alg_x and Alg_y have the same running time. However, practically, Alg_x can run much faster than Alg_y, because matrix multiplications have been specially designed and optimized in Matlab.

Then, my problem is, how to close such 'code optimization' in order to fairly compare the running time and reflect the theoretical time complexity?

%%%%%  X = randn(1000,2000);

Alg_x

tic;
temp = X*X';
toc

Alg_y

[d,n] = size(X);
temp = zeros(d,d);
tic;
for i =1:n
    x = X(:,i);
    temp = temp+x*x';
end
toc

The above two codes have the same output, while Alg_x runs much faster. Moreover, Alg_y will also run much faster after I remove x = X(:,i); temp = temp+x*x';, so I guess it is the for iteration that makes Alg_y run slow.

I do want to close and avoid such oprimizations. Below is something I extracted from Why is MATLAB so fast in matrix multiplication?

I am making some benchmarks with CUDA, C++, C#, and Java, and using MATLAB for verification and matrix generation. But when I multiply with MATLAB, 2048x2048 and even bigger matrices are almost instantly multiplied.

             1024x1024   2048x2048   4096x4096
             ---------   ---------   ---------
CUDA C (ms)      43.11      391.05     3407.99
C++ (ms)       6137.10    64369.29   551390.93
C# (ms)       10509.00   300684.00  2527250.00
Java (ms)      9149.90    92562.28   838357.94
MATLAB (ms)      75.01      423.10     3133.90

Only CUDA is competitive, but I thought that at least C++ will be somewhat close and not 60x slower.

So my question is - How is MATLAB doing it that fast?

C++ Code:

float temp = 0;
timer.start();
for(int j = 0; j < rozmer; j++)
{
    for (int k = 0; k < rozmer; k++)
    {
        temp = 0;
        for (int m = 0; m < rozmer; m++)
        {
            temp = temp + matice1[j][m] * matice2[m][k];
        }
        matice3[j][k] = temp;
    }
}
timer.stop();

Edit: I also dont know what to think about the C# results. The algorithm is just the same as C++ and Java, but there's a giant jump 2048 from 1024?

Edit2: Updated MATLAB and 4096x4096 results

Matrix multiplication is **designed** to be fast. It internally uses SuiteSparse to achieve fast computation times. You don't have a choice but to use the fast algorithms that matrix multiplication provides for you in MATLAB (why would you not want to anyway?). Why do are performing this endeavour? If I may make a suggestion, you could look at this in a more theoretical perspective if you turned off JIT. JIT accelerates the running of `for` loops which may mess with your "theoretical" time measuring. Type in `feature accel off;` in your command prompt and try your loop code again. — rayryeng, May 01 '16 at 07:28
This article, https://people.freebsd.org/~lstewart/articles/cpumemory.pdf, on memory details the outcomes of why a matrix matrix multiply can be written to perform much faster than the list of its matrix vector multiplies. The main reason is how cache is handled. FLOPS is not the only performance indicator in computing as well as algorithmic complexity (some might say log(n) is constant). What *theoretical time complexity* metric do you want to illustrate ? — Florent DUGUET, May 01 '16 at 12:51
Hi, thanks for all your kind help. @FlorentDUGUET My time complexity metric is just flops, and the complexity shoud ignore how cache works. — olivia, May 02 '16 at 03:56

Royi · Accepted Answer · 2016-07-19T09:01:56.273

1

I'm answering your question "How is MATLAB doing it that fast?".

MATLAB uses Intel MKL for Matrix Multiplication.
This is highly optimized code taking advantage of all the cores and their Vector Processing Units (SSE / AVX).
Moreover, it is hand tuned optimized for the cache layout in the CPU.

Your code doesn't do that and hence leaves a lot of gains on the table.

There might be a way to disable MKL in MATLAB.
Though so far I've only seen methods to replace it.

edited Jul 19 '16 at 09:01

answered May 01 '16 at 13:37

Royi

4,640
6
46
64

Hi, thanks for your answer. BTW, can I have some official documents released by MathWorks to support your answer? Many thanks in advance. – olivia Jul 19 '16 at 07:42
You can use `version -blas` in MATLAB and see its BLAS Engine (It will state it is Intel MKL). – Royi Jul 19 '16 at 09:03
Many thanks again. BTW, could I know from which offical documents you kown that 'it is hand tuned optimized for the cache layout in the CPU'. I know what you are talking about is right, but I have to convince others by official documents. – olivia Jul 20 '16 at 03:32
All you need to do is read about Intel MKL - https://en.wikipedia.org/wiki/Math_Kernel_Library and https://software.intel.com/en-us/intel-mkl. – Royi Jul 20 '16 at 05:20

How to avoid the 'code optimization' in operations like matrix multiplication?

1 Answers1

Linked