205

I am making some benchmarks with CUDA, C++, C#, Java, and using MATLAB for verification and matrix generation. When I perform matrix multiplication with MATLAB, 2048x2048 and even bigger matrices are almost instantly multiplied.

             1024x1024   2048x2048   4096x4096
             ---------   ---------   ---------
CUDA C (ms)      43.11      391.05     3407.99
C++ (ms)       6137.10    64369.29   551390.93
C# (ms)       10509.00   300684.00  2527250.00
Java (ms)      9149.90    92562.28   838357.94
MATLAB (ms)      75.01      423.10     3133.90

Only CUDA is competitive, but I thought that at least C++ will be somewhat close and not 60 times slower. I also don't know what to think about the C# results. The algorithm is just the same as C++ and Java, but there's a giant jump 2048 from 1024.

How is MATLAB performing matrix multiplication so fast?

C++ Code:

float temp = 0;
timer.start();
for(int j = 0; j < rozmer; j++)
{
    for (int k = 0; k < rozmer; k++)
    {
        temp = 0;
        for (int m = 0; m < rozmer; m++)
        {
            temp = temp + matice1[j][m] * matice2[m][k];
        }
        matice3[j][k] = temp;
    }
}
timer.stop();
Adriaan
  • 17,741
  • 7
  • 42
  • 75
Wolf
  • 3,117
  • 3
  • 19
  • 10
  • 14
    Probably its a question of which algorithm you use. – Robert J. May 19 '11 at 11:49
  • 26
    Make sure Matlab isn't caching you result, it's a tricky beast. First ensure the calculation is actually being performed, and then compare. – rubenvb May 19 '11 at 11:49
  • 29
    LAPACK and vectorisation. http://www.mathworks.com/company/newsletters/news_notes/clevescorner/winter2000.cleve.html – James May 19 '11 at 11:52
  • Algorithm classic matrix multiplication through 3 for loops, matlab results looks fine, not exact (but very close) when using floats but I think thats because number rounding in languages. – Wolf May 19 '11 at 11:56
  • `tic` and `toc` for rough timing. – KitsuneYMG May 19 '11 at 12:01
  • And don't forget Matlab is probably doing double precision math where the GPU (unless its really new) is doing single precision. – phkahler May 19 '11 at 14:09
  • What about using multiple cores? Your sequential nested loop won't do it. – vbence May 23 '11 at 21:29
  • 11
    I actually do think that this post is really interesting but I would really like to see more appropriate benchmarks. For example, I think that Matlab R2011a is using multithreading automatically and matrix multiplications are implemented using Intel's mkl/blas library. Thus, I would guess that c++ is faster if one used an mkl call to do the matrix multiplication. The question would then be what Matlab's overhead is. I know that this depends on additional details of the matrix multiplication but the above numbers are pretty meaningless right now. – Lucas Jul 14 '11 at 07:03
  • doing "temp += matice1[j][m] * matice2[m][k];" in your c++ code will give you a bit of an edge; even more-so as the matrix grows in size. –  Dec 02 '12 at 16:05
  • You could switch the k-loop and the m-loop. This will boost your C++ code as it uses caches better. See http://martin-thoma.com/matrix-multiplication-python-java-cpp/ – Martin Thoma Dec 21 '12 at 19:01
  • similar question: [Naive C++ Matrix Multiplication 100 times slower than BLAS?](http://codereview.stackexchange.com/questions/20980/naive-c-matrix-multiplication-100-times-slower-than-blas) – Amro Jun 17 '13 at 17:19
  • What CPU/GPU chipsets, compilers, and versions were used to get these metrics? – zackery.fix Jan 02 '16 at 22:52
  • this code is not cache friendly either... – macroland Feb 16 '16 at 10:26
  • 1
    you can use "Strassen algorithm" of running time O(n^2.81) for large square matrix multiplication which is around 10x faster than the native multiplication which runs in O(n^3). also SSE/AVX can help you to get around 8-20x faster for code execution. all together you can have a c implementation faster than the matlab's one. – DU Jiaen Jun 13 '16 at 02:44
  • I never figured out how in the 90's I could `inv(rand(1000))` and it return a result in seconds. That was on a 50 Mhz Intel 286 DX w/ math co-processor. – John Alexiou Apr 12 '17 at 21:33
  • use MKL library for c++ then compare results, – Hadi Jan 18 '19 at 03:11
  • You need to be careful about making fair comparisons with C++. Can you post the C++ code that shows the core inner loops that you're using for matrix multiplication? Mostly, I'm concerned with your memory layout and whether you're doing things wastefully. I've written C++ matrix multiplication that is as fast as Matlab's but it took some care. (EDIT: Before Matlab was using GPUs for this.) You can be virtually guaranteed that Matlab is wasting very few cycles on these "built-in" functions. My question is, where are you wasting cycles? (No offense) – Chris A. May 19 '11 at 15:16
  • Did you check that all the implementations used multi-threading optimizations for the algorithm ? And did they use the same multiplication algorithm ? I really doubt that. Matlab isn't inherently fast, you probably used slow implementations. [Algorithms for efficient matrix multiplication](http://en.wikipedia.org/wiki/Matrix_multiplication#Algorithms_for_efficient_matrix_multiplication) – Yochai Timmer May 19 '11 at 15:32
  • Late to the party here, but it's probably safe to say that the CUDA implementation probably needs a lot of shared memory TLC if it's only meeting Matlab's throughput. Assuming that the GPU being used isnt very outdated. – Steven Lu May 13 '20 at 18:50

12 Answers12

211

This kind of question is recurring and should be answered more clearly than "MATLAB uses highly optimized libraries" or "MATLAB uses the MKL" for once on Stack Overflow.

History:

Matrix multiplication (together with Matrix-vector, vector-vector multiplication and many of the matrix decompositions) is (are) the most important problems in linear algebra. Engineers have been solving these problems with computers since the early days.

I'm not an expert on the history, but apparently back then, everybody just rewrote his FORTRAN version with simple loops. Some standardization then came along, with the identification of "kernels" (basic routines) that most linear algebra problems needed in order to be solved. These basic operations were then standardized in a specification called: Basic Linear Algebra Subprograms (BLAS). Engineers could then call these standard, well-tested BLAS routines in their code, making their work much easier.

BLAS:

BLAS evolved from level 1 (the first version which defined scalar-vector and vector-vector operations) to level 2 (vector-matrix operations) to level 3 (matrix-matrix operations), and provided more and more "kernels" so standardized more and more of the fundamental linear algebra operations. The original FORTRAN 77 implementations are still available on Netlib's website.

Towards better performance:

So over the years (notably between the BLAS level 1 and level 2 releases: early 80s), hardware changed, with the advent of vector operations and cache hierarchies. These evolutions made it possible to increase the performance of the BLAS subroutines substantially. Different vendors then came along with their implementation of BLAS routines which were more and more efficient.

I don't know all the historical implementations (I was not born or a kid back then), but two of the most notable ones came out in the early 2000s: the Intel MKL and GotoBLAS. Your Matlab uses the Intel MKL, which is a very good, optimized BLAS, and that explains the great performance you see.

Technical details on Matrix multiplication:

So why is Matlab (the MKL) so fast at dgemm (double-precision general matrix-matrix multiplication)? In simple terms: because it uses vectorization and good caching of data. In more complex terms: see the article provided by Jonathan Moore.

Basically, when you perform your multiplication in the C++ code you provided, you are not at all cache-friendly. Since I suspect you created an array of pointers to row arrays, your accesses in your inner loop to the k-th column of "matice2": matice2[m][k] are very slow. Indeed, when you access matice2[0][k], you must get the k-th element of the array 0 of your matrix. Then in the next iteration, you must access matice2[1][k], which is the k-th element of another array (the array 1). Then in the next iteration you access yet another array, and so on... Since the entire matrix matice2 can't fit in the highest caches (it's 8*1024*1024 bytes large), the program must fetch the desired element from main memory, losing a lot of time.

If you just transposed the matrix, so that accesses would be in contiguous memory addresses, your code would already run much faster because now the compiler can load entire rows in the cache at the same time. Just try this modified version:

timer.start();
float temp = 0;
//transpose matice2
for (int p = 0; p < rozmer; p++)
{
    for (int q = 0; q < rozmer; q++)
    {
        tempmat[p][q] = matice2[q][p];
    }
}
for(int j = 0; j < rozmer; j++)
{
    for (int k = 0; k < rozmer; k++)
    {
        temp = 0;
        for (int m = 0; m < rozmer; m++)
        {
            temp = temp + matice1[j][m] * tempmat[k][m];
        }
        matice3[j][k] = temp;
    }
}
timer.stop();

So you can see how just cache locality increased your code's performance quite substantially. Now real dgemm implementations exploit that to a very extensive level: They perform the multiplication on blocks of the matrix defined by the size of the TLB (Translation lookaside buffer, long story short: what can effectively be cached), so that they stream to the processor exactly the amount of data it can process. The other aspect is vectorization, they use the processor's vectorized instructions for optimal instruction throughput, which you can't really do from your cross-platform C++ code.

Finally, people claiming that it's because of Strassen's or Coppersmith–Winograd algorithm are wrong, both these algorithms are not implementable in practice, because of hardware considerations mentioned above.

TylerH
  • 20,799
  • 66
  • 75
  • 101
reverse_engineer
  • 4,239
  • 4
  • 18
  • 27
  • 3
    I just watched a Scott Meyers video on the importance of cache sizes and fitting data into cache line sizes, and the problems that you can have with multi-threaded solutions that have no shared data in the source but end up with data shared at the hardware/core-thread level : https://youtu.be/WDIkqP4JbkE – WillC Dec 06 '18 at 03:42
  • the link to the articles is dead. Would it be possible for you to update it? – Our Oct 14 '22 at 07:58
  • @Our The link works for me now... – reverse_engineer Aug 23 '23 at 07:32
97

Here's my results using MATLAB R2011a + Parallel Computing Toolbox on a machine with a Tesla C2070:

>> A = rand(1024); gA = gpuArray(A);
% warm up by executing the operations a couple of times, and then:
>> tic, C = A * A; toc
Elapsed time is 0.075396 seconds.
>> tic, gC = gA * gA; toc
Elapsed time is 0.008621 seconds.

MATLAB uses highly optimized libraries for matrix multiplication which is why the plain MATLAB matrix multiplication is so fast. The gpuArray version uses MAGMA.

Update using R2014a on a machine with a Tesla K20c, and the new timeit and gputimeit functions:

>> A = rand(1024); gA = gpuArray(A);
>> timeit(@()A*A)
ans =
    0.0324
>> gputimeit(@()gA*gA)
ans =
    0.0022

Update using R2018b on a WIN64 machine with 16 physical cores and a Tesla V100:

>> timeit(@()A*A)
ans =
    0.0229
>> gputimeit(@()gA*gA)
ans =
   4.8019e-04

(NB: at some point (I forget when exactly) gpuArray switched from MAGMA to cuBLAS - MAGMA is still used for some gpuArray operations though)

Update using R2022a on a WIN64 machine with 32 physical cores and an A100 GPU:

>> timeit(@()A*A)
ans =
    0.0076
>> gputimeit(@()gA*gA)
ans =
   2.5344e-04
Edric
  • 23,676
  • 2
  • 38
  • 40
42

This is why. MATLAB doesn't perform a naive matrix multiplication by looping over every single element the way you did in your C++ code.

Of course I'm assuming that you just used C=A*B instead of writing a multiplication function yourself.

Doug Stephen
  • 7,181
  • 1
  • 38
  • 46
20

Matlab incorporated LAPACK some time ago, so I assume their matrix multiplication uses something at least that fast. LAPACK source code and documentation is readily available.

You might also look at Goto and Van De Geijn's paper "Anatomy of High-Performance Matrix Multiplication" at http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.140.1785&rep=rep1&type=pdf

Jonathan Moore
  • 514
  • 2
  • 9
  • 7
    MATLAB uses Intel MKL Library which provides optimized implementation of BLAS/LAPACK routines: http://stackoverflow.com/a/16723946/97160 – Amro Jun 17 '13 at 17:22
13

The answer is LAPACK and BLAS libraries make MATLAB blindingly fast at matrix operations, not any proprietary code by the folks at MATLAB.

Use the LAPACK and/or BLAS libraries in your C++ code for matrix operations and you should get similar performance as MATLAB. These libraries should be freely available on any modern system and parts were developed over decades in academia. Note that there are multiple implementations, including some closed source such as Intel MKL.

A discussion of how BLAS gets high performance is available here.


BTW, it's a serious pain in my experience to call LAPACK libraries directly from c (but worth it). You need to read the documentation VERY precisely.

Matthew Gunn
  • 4,451
  • 1
  • 12
  • 30
9

When doing matrix multiplying, you use naive multiplication method which takes time of O(n^3).

There exist matrix multiplication algorithm which takes O(n^2.4). Which means that at n=2000 your algorithm requires ~100 times as much computation as the best algorithm.
You should really check the wikipedia page for matrix multiplication for further information on the efficient ways to implement it.

ElderMael
  • 7,000
  • 5
  • 34
  • 53
Jouni Osmala
  • 115
  • 1
  • 2
  • and MATLAB probably use such an algorithm since the time for 1024*1024 matrix multiply is smaller than 8 times the time for 2048*2048 matrix multiplication ! Well done MATLAB guys. – Renaud Oct 16 '13 at 15:11
  • 6
    I rather doubt they use the "efficient" multiplication algorithms, despite their theoretical advantages. Even Strassen's algorithm has implementation difficulties, and the Coppersmith–Winograd algorithm that you have probably read about just plain *isn't* practical (right now). Also, related SO thread: http://stackoverflow.com/questions/17716565/matrix-multiplication-time-complexity-in-matlab – Ernir Jan 29 '14 at 09:06
  • That algorithm is only for exceedingly large matrices. –  Jun 01 '18 at 17:19
  • @Renaud. That's The definition of relatively constant overhead – Mad Physicist Jun 11 '20 at 16:39
7

Depending on your version of Matlab, I believe it might be using your GPU already.

Another thing; Matlab keeps track of many properties of your matrix; wether its diagonal, hermetian, and so forth, and specializes its algorithms based thereon. Maybe its specializing based on the zero matrix you are passing it, or something like that? Maybe it is caching repeated function calls, which messes up your timings? Perhaps it optimizes out repeated unused matrix products?

To guard against such things happening, use a matrix of random numbers, and make sure you force execution by printing the result to screen or disk or somesuch.

Eelco Hoogendoorn
  • 10,459
  • 1
  • 44
  • 42
  • 4
    As a heavy ML user, I can tell you they aren't using GPGPU yet. New version of matlab DO use SSE1/2 (finally). But I have done tests. A MexFunction performing an element-wise multiplication runs twice as fast as `A.*B` does. So the OP is almost certainly goofing on something. – KitsuneYMG May 19 '11 at 12:00
  • 6
    Matlab with Parallel Computing Toolbox *can* use a CUDA GPU, but it's explicit - you have to push the data to the GPU. – Edric May 19 '11 at 12:38
  • I use M1 = single(rand(1024,1024)*255); M2 = single(rand(1024,1024)*255); and M3 = M1 * M2; ... then write to binary file of floats, its all done very quickly. – Wolf May 19 '11 at 13:45
4

The general answer to "Why is matlab faster at doing xxx than other programs" is that matlab has a lot of built in, optimized functions.

The other programs that are used often do not have these functions so people apply their own creative solutions, which are suprisingly slower than professionally optimized code.

This can be interpreted in two ways:

1) The common/theoretical way: Matlab is not significantly faster, you are just doing the benchmark wrong

2) The realistic way: For this stuff Matlab is faster in practice because languages as c++ are just too easily used in ineffective ways.

Dennis Jaheruddin
  • 21,208
  • 8
  • 66
  • 122
  • 8
    He's comparing MATLAB speed with the speed of a function he wrote in two minutes. I can write a faster function in 10 minutes, or a much faster function in two hours. The MATLAB guys have spent more than two hours on making their matrix multiplication fast. – gnasher729 Apr 04 '14 at 17:13
4

MATLAB uses a highly optimized implementation of LAPACK from Intel known as Intel Math Kernel Library (Intel MKL) - specifically the dgemm function. The speed This library takes advantage of processor features including SIMD instructions and multi-core processors. They don't document which specific algorithm they use. If you were to call Intel MKL from C++ you should see similar performance.

I am not sure what library MATLAB uses for GPU multiplication but probably something like nVidia CUBLAS.

gregswiss
  • 1,456
  • 9
  • 20
  • 1
    You are right, but have you seen [this answer](http://stackoverflow.com/a/31231704/2778484)? However, IPP is not MKL and MKL has far superior linear algebra performance compared to IPP. Also, IPP deprecated their matrix math module in recent versions. – chappjc Sep 10 '15 at 06:32
  • Sorry I meant MKL not IPP – gregswiss Sep 10 '15 at 06:38
  • You are right the other answer covers it. It's so verbose I missed it. – gregswiss Sep 11 '15 at 05:59
3

Because MATLAB is a programming language at first developed for numerical linear algebra (matrix manipulations), which has libraries especially developed for matrix multiplications. And now MATLAB can also use the GPUs (Graphics processing unit) for this additionally.

And if we look at your computation results:

             1024x1024   2048x2048   4096x4096
             ---------   ---------   ---------
CUDA C (ms)      43.11      391.05     3407.99
C++ (ms)       6137.10    64369.29   551390.93
C# (ms)       10509.00   300684.00  2527250.00
Java (ms)      9149.90    92562.28   838357.94
MATLAB (ms)      75.01      423.10     3133.90

then we can see that not only MATLAB is so fast in matrix multiplication: CUDA C (programming language from NVIDIA) has some better results than MATLAB. CUDA C has also libraries especially developed for matrix multiplications and it uses the GPUs.

Short history of MATLAB

Cleve Moler, the chairman of the computer science department at the University of New Mexico, started developing MATLAB in the late 1970s. He designed it to give his students access to LINPACK (a software library for performing numerical linear algebra) and EISPACK (is a software library for numerical computation of linear algebra) without them having to learn Fortran. It soon spread to other universities and found a strong audience within the applied mathematics community. Jack Little, an engineer, was exposed to it during a visit Moler made to Stanford University in 1983. Recognizing its commercial potential, he joined with Moler and Steve Bangert. They rewrote MATLAB in C and founded MathWorks in 1984 to continue its development. These rewritten libraries were known as JACKPAC. In 2000, MATLAB was rewritten to use a newer set of libraries for matrix manipulation, LAPACK (is a standard software library for numerical linear algebra).

Source

What is CUDA C

CUDA C uses also libraries especially developed for matrix multiplications like OpenGL (Open Graphics Library). It uses also GPU and Direct3D (on MS Windows).

The CUDA platform is designed to work with programming languages such as C, C++, and Fortran. This accessibility makes it easier for specialists in parallel programming to use GPU resources, in contrast to prior APIs like Direct3D and OpenGL, which required advanced skills in graphics programming. Also, CUDA supports programming frameworks such as OpenACC and OpenCL.

enter image description here

Example of CUDA processing flow:

  1. Copy data from main memory to GPU memory
  2. CPU initiates the GPU compute kernel
  3. GPU's CUDA cores execute the kernel in parallel
  4. Copy the resulting data from GPU memory to main memory

Comparing CPU and GPU Execution Speeds

We ran a benchmark in which we measured the amount of time it took to execute 50 time steps for grid sizes of 64, 128, 512, 1024, and 2048 on an Intel Xeon Processor X5650 and then using an NVIDIA Tesla C2050 GPU.

enter image description here

For a grid size of 2048, the algorithm shows a 7.5x decrease in compute time from more than a minute on the CPU to less than 10 seconds on the GPU. The log scale plot shows that the CPU is actually faster for small grid sizes. As the technology evolves and matures, however, GPU solutions are increasingly able to handle smaller problems, a trend that we expect to continue.

Source

From introduction for CUDA C Programming Guide:

Driven by the insatiable market demand for realtime, high-definition 3D graphics, the programmable Graphic Processor Unit or GPU has evolved into a highly parallel, multithreaded, manycore processor with tremendous computational horsepower and very high memory bandwidth, as illustrated by Figure 1 and Figure 2.

Figure 1. Floating-Point Operations per Second for the CPU and GPU

enter image description here

Figure 2. Memory Bandwidth for the CPU and GPU

enter image description here

The reason behind the discrepancy in floating-point capability between the CPU and the GPU is that the GPU is specialized for compute-intensive, highly parallel computation - exactly what graphics rendering is about - and therefore designed such that more transistors are devoted to data processing rather than data caching and flow control, as schematically illustrated by Figure 3.

Figure 3. The GPU Devotes More Transistors to Data Processing

enter image description here

More specifically, the GPU is especially well-suited to address problems that can be expressed as data-parallel computations - the same program is executed on many data elements in parallel - with high arithmetic intensity - the ratio of arithmetic operations to memory operations. Because the same program is executed for each data element, there is a lower requirement for sophisticated flow control, and because it is executed on many data elements and has high arithmetic intensity, the memory access latency can be hidden with calculations instead of big data caches.

Data-parallel processing maps data elements to parallel processing threads. Many applications that process large data sets can use a data-parallel programming model to speed up the computations. In 3D rendering, large sets of pixels and vertices are mapped to parallel threads. Similarly, image and media processing applications such as post-processing of rendered images, video encoding and decoding, image scaling, stereo vision, and pattern recognition can map image blocks and pixels to parallel processing threads. In fact, many algorithms outside the field of image rendering and processing are accelerated by data-parallel processing, from general signal processing or physics simulation to computational finance or computational biology.

Source

Advanced reading


Some interesting facs

I've written C++ matrix multiplication that is as fast as Matlab's but it took some care. (Before Matlab was using GPUs for this).

Сitation from this answer.

Bharata
  • 13,509
  • 6
  • 36
  • 50
  • 2
    That last quote is not “a fact”, it’s empty boasting. That person has gotten several requests for code since he posted that. But no code in sight. – Cris Luengo Jan 17 '19 at 14:31
  • 1
    Your description of how quickly you can do computations on GPU do not address the question at all. We all know that 128 little cores can do more of the same, monotonous work than 2 big cores. “And now MATLAB can also use the GPUs (Graphics processing unit) for this additionally.” Yes, but not by default. Normal matrix multiplication still uses BLAS. – Cris Luengo Jan 17 '19 at 14:34
  • @CrisLuengo, OK, it is not a fact! Maybe you have right about his "boasting" – we do not know about it and we also do not know why he does not answer. For second comment: description of computations on GPU answers the question because for matrix multiplications in linear algebra it uses floating-point operations. Maybe it is not for all peole understandable, but I think that they have to understand this basics. In other case they have to learn this basics at first before they read some article about matrices. And if someone other will write me about it then I will add this details. Thank you! – Bharata Jan 17 '19 at 15:01
  • @CrisLuengo, I wrote the word `"additionally"`. It means: it can be used. It also means that normal matrix multiplication still uses software libraries. Do you think that I have to change my post to be more understandable? Thank you for your comments! – Bharata Jan 17 '19 at 15:09
2

The sharp contrast is not only due to Matlab's amazing optimization (as discussed by many other answers already), but also in the way you formulated matrix as an object.

It seems like you made matrix a list of lists? A list of lists contains pointers to lists which then contain your matrix elements. The locations of the contained lists are assigned arbitrarily. As you are looping over your first index (row number?), the time of memory access is very significant. In comparison, why don't you try implement matrix as a single list/vector using the following method?

#include <vector>

struct matrix {
    matrix(int x, int y) : n_row(x), n_col(y), M(x * y) {}
    int n_row;
    int n_col;
    std::vector<double> M;
    double &operator()(int i, int j);
};

And

double &matrix::operator()(int i, int j) {
    return M[n_col * i + j];
}

The same multiplication algorithm should be used so that the number of flop is the same. (n^3 for square matrices of size n)

I'm asking you to time it so that the result is comparable to what you had earlier (on the same machine). With the comparison, you will show exactly how significant memory access time can be!

Argyll
  • 8,591
  • 4
  • 25
  • 46
2

It's slow in C++ because you are not using multithreading. Essentially, if A = B C, where they are all matrices, the first row of A can be computed independently from the 2nd row, etc. If A, B, and C are all n by n matrices, you can speed up the multiplication by a factor of n^2, as

a_{i,j} = sum_{k} b_{i,k} c_{k,j}

If you use, say, Eigen [ http://eigen.tuxfamily.org/dox/GettingStarted.html ], multithreading is built-in and the number of threads is adjustable.

wsw
  • 183
  • 3
  • 10