Why is a naïve C++ matrix multiplication 100 times slower than BLAS?

Question

I am taking a look at large matrix multiplication and ran the following experiment to form a baseline test:

Randomly generate two 4096x4096 matrixes X, Y from std normal (0 mean, 1 stddev).
Z = X*Y
Sum elements of Z (to make sure they are accessed) and output.

Here is the naïve C++ implementatation:

#include <iostream>
#include <algorithm>

using namespace std;

int main()
{
    constexpr size_t dim = 4096;

    float* x = new float[dim*dim];
    float* y = new float[dim*dim];
    float* z = new float[dim*dim];

    random_device rd;
    mt19937 gen(rd());
    normal_distribution<float> dist(0, 1);

    for (size_t i = 0; i < dim*dim; i++)
    {
        x[i] = dist(gen);
        y[i] = dist(gen);
    }

    for (size_t row = 0; row < dim; row++)
        for (size_t col = 0; col < dim; col++)
        {
            float acc = 0;

            for (size_t k = 0; k < dim; k++)
                acc += x[row*dim + k] * y[k*dim + col];

            z[row*dim + col] = acc;
        }

    float t = 0;

    for (size_t i = 0; i < dim*dim; i++)
        t += z[i];

    cout << t << endl;

    delete x;
    delete y;
    delete z;
}

Compile and run:

$ g++ -std=gnu++11 -O3 test.cpp
$ time ./a.out

Here is the Octave/matlab implementation:

X = stdnormal_rnd(4096, 4096);
Y = stdnormal_rnd(4096, 4096);
Z = X*Y;
sum(sum(Z))

Run:

$ time octave < test.octave

Octave under the hood is using BLAS (I assume the sgemm function)

The hardware is i7 3930X on Linux x86-64 with 24 GB of ram. BLAS appears to be using two cores. Perhaps a hyperthreaded pair?

I found that the C++ version compiled with GCC 4.7 on -O3 took 9 minutes to execute:

real    9m2.126s
user    9m0.302s
sys         0m0.052s

The octave version took 6 seconds:

real    0m5.985s
user    0m10.881s
sys         0m0.144s

I understand that BLAS is optimized to all hell, and the naïve algorithm is totally ignoring caches and so on, but seriously -- 90 times?

Can anyone explain this difference? What exactly is the architecture of the BLAS implementation? I see it is using Fortran, but what is happening at the CPU level? What algorithm is it using? How is it using the CPU caches? What x86-64 machine instructions does it call? (Is it using advanced CPU features like AVX?) Where does it get this extra speed from?

Which key optimizations to the C++ algorithm could get it on par with the BLAS version?

I ran octave under gdb and stopped it half way through computation a few times. It had started a second thread and here are the stacks (all stops it looked similar):

(gdb) thread 1
#0  0x00007ffff6e17148 in pthread_join () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x00007ffff1626721 in ATL_join_tree () from /usr/lib/libblas.so.3
#2  0x00007ffff1626702 in ATL_join_tree () from /usr/lib/libblas.so.3
#3  0x00007ffff15ae357 in ATL_dptgemm () from /usr/lib/libblas.so.3
#4  0x00007ffff1384b59 in atl_f77wrap_dgemm_ () from /usr/lib/libblas.so.3
#5  0x00007ffff193effa in dgemm_ () from /usr/lib/libblas.so.3
#6  0x00007ffff6049727 in xgemm(Matrix const&, Matrix const&, blas_trans_type, blas_trans_type) () from /usr/lib/x86_64-linux-gnu/liboctave.so.1
#7  0x00007ffff6049954 in operator*(Matrix const&, Matrix const&) () from /usr/lib/x86_64-linux-gnu/liboctave.so.1
#8  0x00007ffff7839e4e in ?? () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1
#9  0x00007ffff765a93a in do_binary_op(octave_value::binary_op, octave_value const&, octave_value const&) () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1
#10 0x00007ffff76c4190 in tree_binary_expression::rvalue1(int) () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1
#11 0x00007ffff76c33a5 in tree_simple_assignment::rvalue1(int) () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1
#12 0x00007ffff76d0864 in tree_evaluator::visit_statement(tree_statement&) () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1
#13 0x00007ffff76cffae in tree_evaluator::visit_statement_list(tree_statement_list&) () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1
#14 0x00007ffff757f6d4 in main_loop() () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1
#15 0x00007ffff7527abf in octave_main () from /usr/lib/x86_64-linux-gnu/liboctinterp.so.1

(gdb) thread 2
#0  0x00007ffff14ba4df in ATL_dJIK56x56x56TN56x56x0_a1_b1 () from /usr/lib/libblas.so.3
(gdb) bt
#0  0x00007ffff14ba4df in ATL_dJIK56x56x56TN56x56x0_a1_b1 () from /usr/lib/libblas.so.3
#1  0x00007ffff15a5fd7 in ATL_dmmIJK2 () from /usr/lib/libblas.so.3
#2  0x00007ffff15a6ae4 in ATL_dmmIJK () from /usr/lib/libblas.so.3
#3  0x00007ffff1518e65 in ATL_dgemm () from /usr/lib/libblas.so.3
#4  0x00007ffff15adf7a in ATL_dptgemm0 () from /usr/lib/libblas.so.3
#5  0x00007ffff6e15e9a in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#6  0x00007ffff6b41cbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#7  0x0000000000000000 in ?? ()

It is calling BLAS gemm as expected.

The first thread appears to be joining the second, so I am not sure whether these two threads account for the 200% CPU usage observed or not.

Which library is ATL_dgemm libblas.so.3 and where is its code?

$ ls -al /usr/lib/libblas.so.3
/usr/lib/libblas.so.3 -> /etc/alternatives/libblas.so.3

$ ls -al /etc/alternatives/libblas.so.3
/etc/alternatives/libblas.so.3 -> /usr/lib/atlas-base/atlas/libblas.so.3

$ ls -al /usr/lib/atlas-base/atlas/libblas.so.3
/usr/lib/atlas-base/atlas/libblas.so.3 -> libblas.so.3.0

$ ls -al /usr/lib/atlas-base/atlas/libblas.so.3.0
/usr/lib/atlas-base/atlas/libblas.so.3.0

$ dpkg -S /usr/lib/atlas-base/atlas/libblas.so.3.0
libatlas3-base: /usr/lib/atlas-base/atlas/libblas.so.3.0

$ apt-get source libatlas3-base

It is ATLAS 3.8.4

Here are the optimizations I later implemented:

Using a tiled approach where I preload 64x64 blocks of X, Y and Z into separate arrays.

Changing the calculation of each block so that the inner loop looks like this:

for (size_t tcol = 0; tcol < block_width; tcol++)
    bufz[trow][tcol] += B * bufy[tk][tcol];

This allows GCC to autovectorize to SIMD instructions and also allows for instruction level parallelism (I think).

Turning on march=corei7-avx. This gains 30% extra speed but is cheating because I think the BLAS library is prebuilt.

Here is the code:

#include <iostream>
#include <algorithm>

using namespace std;

constexpr size_t dim = 4096;
constexpr size_t block_width = 64;
constexpr size_t num_blocks = dim / block_width;

double X[dim][dim], Y[dim][dim], Z[dim][dim];

double bufx[block_width][block_width];
double bufy[block_width][block_width];
double bufz[block_width][block_width];

void calc_block()
{
    for (size_t trow = 0; trow < block_width; trow++)
        for (size_t tk = 0; tk < block_width; tk++)
        {
            double B = bufx[trow][tk];

            for (size_t tcol = 0; tcol < block_width; tcol++)
                bufz[trow][tcol] += B * bufy[tk][tcol];
        }
}

int main()
{
    random_device rd;
    mt19937 gen(rd());
    normal_distribution<double> dist(0, 1);

    for (size_t row = 0; row < dim; row++)
        for (size_t col = 0; col < dim; col++)
        {
            X[row][col] = dist(gen);
            Y[row][col] = dist(gen);
            Z[row][col] = 0;
        }

    for (size_t block_row = 0; block_row < num_blocks; block_row++)
        for (size_t block_col = 0; block_col < num_blocks; block_col++)
        {
            for (size_t trow = 0; trow < block_width; trow++)
                for (size_t tcol = 0; tcol < block_width; tcol++)
                    bufz[trow][tcol] = 0;

            for (size_t block_k = 0; block_k < num_blocks; block_k++)
            {
                for (size_t trow = 0; trow < block_width; trow++)
                    for (size_t tcol = 0; tcol < block_width; tcol++)
                    {
                        bufx[trow][tcol] = X[block_row*block_width + trow][block_k*block_width + tcol];
                        bufy[trow][tcol] = Y[block_k*block_width + trow][block_col*block_width + tcol];
                    }

                calc_block();
            }

            for (size_t trow = 0; trow < block_width; trow++)
                for (size_t tcol = 0; tcol < block_width; tcol++)
                    Z[block_row*block_width + trow][block_col*block_width + tcol] = bufz[trow][tcol];

        }

    double t = 0;

    for (size_t row = 0; row < dim; row++)
        for (size_t col = 0; col < dim; col++)
            t += Z[row][col];

    cout << t << endl;
}

All the action is in the calc_block function - over 90% of the time is spent in it.

The new time is:

real    0m17.370s
user    0m17.213s
sys 0m0.092s

Which is much closer.

The decompile of the calc_block function is as follows:

0000000000401460 <_Z10calc_blockv>:
  401460:   b8 e0 21 60 00          mov    $0x6021e0,%eax
  401465:   41 b8 e0 23 61 00       mov    $0x6123e0,%r8d
  40146b:   31 ff                   xor    %edi,%edi
  40146d:   49 29 c0                sub    %rax,%r8
  401470:   49 8d 34 00             lea    (%r8,%rax,1),%rsi
  401474:   48 89 f9                mov    %rdi,%rcx
  401477:   ba e0 a1 60 00          mov    $0x60a1e0,%edx
  40147c:   48 c1 e1 09             shl    $0x9,%rcx
  401480:   48 81 c1 e0 21 61 00    add    $0x6121e0,%rcx
  401487:   66 0f 1f 84 00 00 00    nopw   0x0(%rax,%rax,1)
  40148e:   00 00 
  401490:   c4 e2 7d 19 01          vbroadcastsd (%rcx),%ymm0
  401495:   48 83 c1 08             add    $0x8,%rcx
  401499:   c5 fd 59 0a             vmulpd (%rdx),%ymm0,%ymm1
  40149d:   c5 f5 58 08             vaddpd (%rax),%ymm1,%ymm1
  4014a1:   c5 fd 29 08             vmovapd %ymm1,(%rax)
  4014a5:   c5 fd 59 4a 20          vmulpd 0x20(%rdx),%ymm0,%ymm1
  4014aa:   c5 f5 58 48 20          vaddpd 0x20(%rax),%ymm1,%ymm1
  4014af:   c5 fd 29 48 20          vmovapd %ymm1,0x20(%rax)
  4014b4:   c5 fd 59 4a 40          vmulpd 0x40(%rdx),%ymm0,%ymm1
  4014b9:   c5 f5 58 48 40          vaddpd 0x40(%rax),%ymm1,%ymm1
  4014be:   c5 fd 29 48 40          vmovapd %ymm1,0x40(%rax)
  4014c3:   c5 fd 59 4a 60          vmulpd 0x60(%rdx),%ymm0,%ymm1
  4014c8:   c5 f5 58 48 60          vaddpd 0x60(%rax),%ymm1,%ymm1
  4014cd:   c5 fd 29 48 60          vmovapd %ymm1,0x60(%rax)
  4014d2:   c5 fd 59 8a 80 00 00    vmulpd 0x80(%rdx),%ymm0,%ymm1
  4014d9:   00 
  4014da:   c5 f5 58 88 80 00 00    vaddpd 0x80(%rax),%ymm1,%ymm1
  4014e1:   00 
  4014e2:   c5 fd 29 88 80 00 00    vmovapd %ymm1,0x80(%rax)
  4014e9:   00 
  4014ea:   c5 fd 59 8a a0 00 00    vmulpd 0xa0(%rdx),%ymm0,%ymm1
  4014f1:   00 
  4014f2:   c5 f5 58 88 a0 00 00    vaddpd 0xa0(%rax),%ymm1,%ymm1
  4014f9:   00 
  4014fa:   c5 fd 29 88 a0 00 00    vmovapd %ymm1,0xa0(%rax)
  401501:   00 
  401502:   c5 fd 59 8a c0 00 00    vmulpd 0xc0(%rdx),%ymm0,%ymm1
  401509:   00 
  40150a:   c5 f5 58 88 c0 00 00    vaddpd 0xc0(%rax),%ymm1,%ymm1
  401511:   00 
  401512:   c5 fd 29 88 c0 00 00    vmovapd %ymm1,0xc0(%rax)
  401519:   00 
  40151a:   c5 fd 59 8a e0 00 00    vmulpd 0xe0(%rdx),%ymm0,%ymm1
  401521:   00 
  401522:   c5 f5 58 88 e0 00 00    vaddpd 0xe0(%rax),%ymm1,%ymm1
  401529:   00 
  40152a:   c5 fd 29 88 e0 00 00    vmovapd %ymm1,0xe0(%rax)
  401531:   00 
  401532:   c5 fd 59 8a 00 01 00    vmulpd 0x100(%rdx),%ymm0,%ymm1
  401539:   00 
  40153a:   c5 f5 58 88 00 01 00    vaddpd 0x100(%rax),%ymm1,%ymm1
  401541:   00 
  401542:   c5 fd 29 88 00 01 00    vmovapd %ymm1,0x100(%rax)
  401549:   00 
  40154a:   c5 fd 59 8a 20 01 00    vmulpd 0x120(%rdx),%ymm0,%ymm1
  401551:   00 
  401552:   c5 f5 58 88 20 01 00    vaddpd 0x120(%rax),%ymm1,%ymm1
  401559:   00 
  40155a:   c5 fd 29 88 20 01 00    vmovapd %ymm1,0x120(%rax)
  401561:   00 
  401562:   c5 fd 59 8a 40 01 00    vmulpd 0x140(%rdx),%ymm0,%ymm1
  401569:   00 
  40156a:   c5 f5 58 88 40 01 00    vaddpd 0x140(%rax),%ymm1,%ymm1
  401571:   00 
  401572:   c5 fd 29 88 40 01 00    vmovapd %ymm1,0x140(%rax)
  401579:   00 
  40157a:   c5 fd 59 8a 60 01 00    vmulpd 0x160(%rdx),%ymm0,%ymm1
  401581:   00 
  401582:   c5 f5 58 88 60 01 00    vaddpd 0x160(%rax),%ymm1,%ymm1
  401589:   00 
  40158a:   c5 fd 29 88 60 01 00    vmovapd %ymm1,0x160(%rax)
  401591:   00 
  401592:   c5 fd 59 8a 80 01 00    vmulpd 0x180(%rdx),%ymm0,%ymm1
  401599:   00 
  40159a:   c5 f5 58 88 80 01 00    vaddpd 0x180(%rax),%ymm1,%ymm1
  4015a1:   00 
  4015a2:   c5 fd 29 88 80 01 00    vmovapd %ymm1,0x180(%rax)
  4015a9:   00 
  4015aa:   c5 fd 59 8a a0 01 00    vmulpd 0x1a0(%rdx),%ymm0,%ymm1
  4015b1:   00 
  4015b2:   c5 f5 58 88 a0 01 00    vaddpd 0x1a0(%rax),%ymm1,%ymm1
  4015b9:   00 
  4015ba:   c5 fd 29 88 a0 01 00    vmovapd %ymm1,0x1a0(%rax)
  4015c1:   00 
  4015c2:   c5 fd 59 8a c0 01 00    vmulpd 0x1c0(%rdx),%ymm0,%ymm1
  4015c9:   00 
  4015ca:   c5 f5 58 88 c0 01 00    vaddpd 0x1c0(%rax),%ymm1,%ymm1
  4015d1:   00 
  4015d2:   c5 fd 29 88 c0 01 00    vmovapd %ymm1,0x1c0(%rax)
  4015d9:   00 
  4015da:   c5 fd 59 82 e0 01 00    vmulpd 0x1e0(%rdx),%ymm0,%ymm0
  4015e1:   00 
  4015e2:   c5 fd 58 80 e0 01 00    vaddpd 0x1e0(%rax),%ymm0,%ymm0
  4015e9:   00 
  4015ea:   48 81 c2 00 02 00 00    add    $0x200,%rdx
  4015f1:   48 39 ce                cmp    %rcx,%rsi
  4015f4:   c5 fd 29 80 e0 01 00    vmovapd %ymm0,0x1e0(%rax)
  4015fb:   00 
  4015fc:   0f 85 8e fe ff ff       jne    401490 <_Z10calc_blockv+0x30>
  401602:   48 83 c7 01             add    $0x1,%rdi
  401606:   48 05 00 02 00 00       add    $0x200,%rax
  40160c:   48 83 ff 40             cmp    $0x40,%rdi
  401610:   0f 85 5a fe ff ff       jne    401470 <_Z10calc_blockv+0x10>
  401616:   c5 f8 77                vzeroupper 
  401619:   c3                      retq   
  40161a:   66 0f 1f 44 00 00       nopw   0x0(%rax,%rax,1)

Welcome to the world of caches. They can do amazing things... — Mysticial, Jan 25 '13 at 23:54
150x is no where near the speedup you might be able to get out of better caching behaviour and other optimizations for your problem. — Carl Norum, Jan 25 '13 at 23:54
You're timing the generation of 32m random numbers as well! I don't know how fast those RNGs are, but to get a real idea of where the problem might be, you should time only the multiplication! You could use some low level API or `boost::auto_cpu_timer` for example to measure times within the program. — us2012, Jan 25 '13 at 23:56
@us2012: I already tested it to make sure, the multiplication dominates the time in both cases. — Andrew Tomazos, Jan 25 '13 at 23:57
`acc += x[row*dim + k] * y[k*dim + col]; z[row*dim + col] = acc;` is awful. — Luchian Grigore, Jan 25 '13 at 23:57
What Linux Kernel version are you using? I'm trying to figure out whether BLAS is using AVX instructions on your machine. — Mysticial, Jan 26 '13 at 00:06
"I believe both are using only a single core." There is no way the BLAS version is using only one core: note how CPU time is about twice as much as wall clock time. — R. Martinho Fernandes, Jan 26 '13 at 00:07
@R.MartinhoFernandes: You're right it is using 2 cores I believe. Top is reporting 200% CPU usage, and this gels with the real/user difference. That explains 2x of it. 45x to go :) — Andrew Tomazos, Jan 26 '13 at 00:09
Out of curiosity, can you try with `4097` instead of `4096`? — Luchian Grigore, Jan 26 '13 at 00:10
@LuchianGrigore That will probably improve the time of his code, but it definitely won't get it near BLAS. — Mysticial, Jan 26 '13 at 00:12
@LuchianGrigore: Octave gives: `real 0m6.203s user 0m11.337s sys 0m0.124s`. Between 0-10% degredation. (That is for 4097 x 4097) — Andrew Tomazos, Jan 26 '13 at 00:12
Actually, it shouldn't matter, because you're not using the cache anyways :) — Luchian Grigore, Jan 26 '13 at 00:13
@user1131467 I'm running the numbers right now. And I'm actually not impressed by the BLAS numbers that you're getting. For 2 threads, it should actually be twice as fast as what you're seeing right now. Do you have any idea whether BLAS is using AVX? — Mysticial, Jan 26 '13 at 00:13
@LuchianGrigore: Its running, you'll have to wait 10 minutes sorry :) — Andrew Tomazos, Jan 26 '13 at 00:14
@user1131467 If you're wondering why Luchian asked you to test 4097, he's trying to test for the effect of [this](http://stackoverflow.com/questions/12264970/why-is-my-program-slow-when-looping-over-exactly-8192-elements). — Mysticial, Jan 26 '13 at 00:15
@Mysticial: I suspect, but I am not sure, that octave is using `libblas` (see `libblas-dev` package), and that seems to call `F77_sgemm` but I am not sure on either connection, and I havent dug into fortran. — Andrew Tomazos, Jan 26 '13 at 00:15
Well, if it helps: If BLAS is not using AVX instructions, then your benchmarks show that it is achieving 76% of the maximum possible speed of the `O(N^3)` on your processor (3.8 GHz, 2 threads). If it is indeed using AVX, then it is only achieving 38% of optimal. 70%+ is typical for a well-optimized dense matrix library. The reason why the naive approach is slow is indeed largely due to cache. — Mysticial, Jan 26 '13 at 00:22
@LuchianGrigore: 4097x4097 C++: `real 9m26.462s user 9m24.495s sys 0m0.084s` — Andrew Tomazos, Jan 26 '13 at 00:27
Correction to my previous comment: I see that you're using `float`s instead of `double`s. That's another factor of 2 that I can't explain. So in other words, BLAS should be 4x faster than what you are seeing. — Mysticial, Jan 26 '13 at 00:28
@Mysticial: Yes Im not clear on whether octave is using float or double, I decided to give it the benefit of the doubt. — Andrew Tomazos, Jan 26 '13 at 00:34

Eric Postpischil · Accepted Answer · 2021-03-25T15:07:58.893

Here are three factors responsible for the performance difference between your code and BLAS (plus a note on Strassen’s algorithm).

In your inner loop, on k, you have y[k*dim + col]. Because of the way memory cache is arranged, consecutive values of k with the same dim and col map to the same cache set. The way cache is structured, each memory address has one cache set where its contents must be held while it is in cache. Each cache set has several lines (four is a typical number), and each of those lines can hold any of the memory addresses that map to that particular cache set.

Because your inner loop iterates through y in this way, each time it uses an element from y, it must load the memory for that element into the same set as the previous iteration did. This forces one of the previous cache lines in the set to be evicted. Then, in the next iteration of the col loop, all of the elements of y have been evicted from cache, so they must be reloaded again.

Thus, every time your loop loads an element of y, it must be loaded from memory, which takes many CPU cycles.

High-performance code avoids this in two ways. One, it divides the work into smaller blocks. The rows and the columns are partitioned into smaller sizes, and processed with shorter loops that are able to use all the elements in a cache line and to use each element several times before they go on to the next block. Thus, most of the references to elements of x and elements of y come from cache, often in a single processor cycle. Two, in some situations, the code will copy data out of a column of a matrix (which thrashes cache due to the geometry) into a row of a temporary buffer (which avoids thrashing). This again allows most of the memory references to be served from cache instead of from memory.

Another factor is the use of Single Instruction Multiple Data (SIMD) features. Many modern processors have instructions that load multiple elements (four float elements is typical, but some now do eight) in one instruction, store multiple elements, add multiple elements (e.g., for each of these four, add it to the corresponding one of those four), multiply multiple elements, and so on. Simply using such instructions immediately makes your code four times faster, provided you are able to arrange your work to use those instructions.

These instructions are not directly accessible in standard C. Some optimizers now try to use such instructions when they can, but this optimization is difficult, and it is not common to gain much benefit from it. Many compilers provide extensions to the language that give access to these instructions. Personally, I usually prefer to write in assembly language to use SIMD.

Another factor is using instruction-level parallel execution features on a processor. Observe that in your inner loop, acc is updated. The next iteration cannot add to acc until the previous iteration has finished updating acc. High-performance code will instead keep multiple sums running in parallel (even multiple SIMD sums). The result of this will be that while the addition for one sum is executing, the addition for another sum will be started. It is common on today’s processors to support four or more floating-point operations in progress at a time. As written, your code cannot do this at all. Some compilers will try to optimize the code by rearranging loops, but this requires the compiler to be able to see that iterations of a particular loop are independent from each other or can be commuted with another loop, et cetera.

It is quite feasible that using cache effectively provides a factor of ten performance improvement, SIMD provides another four, and instruction-level parallelism provides another four, giving 160 altogether.

Here is a very crude estimate of the effect of Strassen’s algorithm, based on this Wikipedia page. The Wikipedia page says Strassen is slightly better than direct multiplication around n = 100. This suggests the ratio of the constant factors of the execution times is 100³ / 100^2.807 ≈ 2.4. Obviously, this will vary tremendously depending on processor model, matrix sizes interacting with cache effects, and so on. However, simple extrapolation shows that Strassen is about twice as good as direct multiplication at n = 4096 ((4096/100)^3-2.807 ≈ 2.05). Again, that is just a ballpark estimate.

As for the later optimizations, consider this code in the inner loop:

bufz[trow][tcol] += B * bufy[tk][tcol];

One potential issue with this is that bufz could, in general, overlap bufy. Since you use global definitions for bufz and bufy, the compiler likely knows they do not overlap in this case. However, if you move this code into a subroutine that is passed bufz and bufy as parameters, and especially if you compile that subroutine in a separate source file, then the compiler is less likely to know that bufz and bufy do not overlap. In that case, the compiler cannot vectorize or otherwise reorder the code, because the bufz[trow][tcol] in this iteration might be the same as bufy[tk][tcol] in another iteration.

Even if the compiler can see that the subroutine is called with different bufz and bufy in the current source module, if the routine has extern linkage (the default), then the compiler has to allow for the routine to be called from an external module, so it must generate code that works correctly if bufz and bufy overlap. (One way the compiler can deal with this is to generate two versions of the routine, one to be called from external modules and one to be called from the module currently being compiled. Whether it does that depends on your compiler, the optimization switches, et cetera.) If you declare the routine as static, then the compiler knows it cannot be called from an external module (unless you take its address and there is a possibility the address is passed outside of the current module).

Another potential issue is that, even if the compiler vectorizes this code, it does not necessarily generate the best code for the processor you execute on. Looking at the generated assembly code, it appears the compiler is using only %ymm1 repeatedly. Over and over again, it multiplies a value from memory into %ymm1, adds a value from memory to %ymm1, and stores a value from %ymm1 to memory. There are two problems with this.

One, you do not want these partial sums stored to memory frequently. You want many additions accumulated into a register, and the register will be written to memory only infrequently. Convincing the compiler to do this likely requires rewriting the code to be explicit about keeping partial sums in temporary objects and writing them to memory after a loop has completed.

Two, these instructions are nominally serially dependent. The add cannot start until the multiply completes, and the store cannot write to memory until the add completes. The Core i7 has great capabilities for out-of-order execution. So, while it has that add waiting to start execution, it looks at the multiply later in the instruction stream and starts it. (Even though that multiply also uses %ymm1, the processor remaps the registers on the fly, so that it uses a different internal register to do this multiply.) Even though your code is filled with consecutive dependencies, the processor tries to execute several instructions at once. However, a number of things can interfere with this. You can run out of the internal registers the processor uses for renaming. The memory addresses you use might run into false conflicts. (The processor looks at a dozen or so of the low bits of memory addresses to see if the address might be the same as another one that it is trying to load or store from an earlier instruction. If the bits are equal, the processor has to delay the current load or store until it can verify the entire address is different. This delay can bollux up more than just the current load or store.) So, it is better to have instructions that are overtly independent.

That is one more reason I prefer to write high-performance code in assembly. To do it in C, you have to convince the compiler to give you instructions like this, by doing things such as writing some of your own SIMD code (using the language extensions for them) and manually unrolling loops (writing out multiple iterations).

When copying into and out of buffers, there might be similar issues. However, you report 90% of the time is spent in calc_block, so I have not looked at this closely.

This answer is good in that it answers a lot of the poorly-understood parts of the algorithm's performance, but as mentioned in other answers, the naive multiplication is much slower than strassen's or another mega-fast algo for large matricies. unless i'm reading it wrong. These algorithms differ hugely based on the size of the matrix so it may not even be a factor. — argentage, Jan 26 '13 at 00:34
Yes, in my hetrogeneous parallel programming class we implemented tiled matrix multiplication using cuda on GPU, which involves optimizing for SIMD and using a block cache ("work group" in opencl) to load a tile of the matrix into cache and then localize work. It is very surprising to see 90x improvement from such techniques in a single threaded environment though. I will investigate and try to support your claim. — Andrew Tomazos, Jan 26 '13 at 00:38
If we transpose y that should give an immediate improvement as we can now access it in column order. — Andrew Tomazos, Jan 26 '13 at 00:41
Great points! I would add that to test the cache-factor on the performance, the `y[k*dim + col]` could simply be changed to `y[col*dim + k]`, just to see how far that goes towards getting closer to BLAS's performance. — Mikael Persson, Jan 26 '13 at 00:41
@MikaelPersson: Holy crap. `real 1m3.350s user 1m2.972s sys 0m0.148s`. 9x improvement. Impressive. ` — Andrew Tomazos, Jan 26 '13 at 00:46
Thought, Is BLAS really transposing one of the matrices though. Of course you could simulate this improvement through tiling. — Andrew Tomazos, Jan 26 '13 at 00:51
@user1131467: Got 9x improvement on my side too. And yes, tiling the storage of the matrix is going to have a very similar effect. And for large matrices, most matrix algorithms can be adapted to and benefit from tiling, so it is very likely that large matrices are always stored that way, period. AFAIK, all optimized linear algebra packages use some form of cache-optimized storage for large matrices, otherwise, it would earn its title of being "optimized". And btw, Octave uses [LAPACK](http://en.wikipedia.org/wiki/LAPACK) (and so does Matlab). — Mikael Persson, Jan 26 '13 at 01:01
@MikaelPersson: Yes and LAPACK uses BLAS I believe: http://en.wikipedia.org/wiki/General_Matrix_Multiply — Andrew Tomazos, Jan 26 '13 at 01:07
@user1131467: Sorry, I confused to levels (LAPACK is higher up, decompositions and linear solvers), most likely, Octave uses [ATLAS](http://en.wikipedia.org/wiki/Automatically_Tuned_Linear_Algebra_Software) as a BLAS-implementation, although you can configure it to use something else. — Mikael Persson, Jan 26 '13 at 01:14
@MikaelPersson: I'll debug and decompile into octave and report what library it is using. — Andrew Tomazos, Jan 26 '13 at 01:30
I've implemented your suggestions. Please see __Update 2__ in my OP. The time is now 17 seconds, which is only 2-3 times slower. If you have any other ideas for why it isn't there yet, please let me know, thanks. — Andrew Tomazos, Jan 26 '13 at 10:05
I don't think ATLAS is using Strassen. It is very hard to reverse engineer ATLAS as the build system is this insane self-tuning monstrosity it is almost impossible to go from symbols in the .so to the original source code without spending days on studying the internals. Plus all the function names uses mnuemonics and acronyms. But from my various readings I get the impression that Strassen no longer works on modern processors (minimal benefit), as it essentially trades multiplies for adds but its memory bandwidth and locality that is the bottleneck these days. — Andrew Tomazos, Jan 27 '13 at 01:19
I think ATLAS is using a tiled approach similiar to mine with the equivilent of calc_block being what they call "kernels". The build system tests different kernels and kernel sizes then compiles in the one with best performance. The prebuilt one selected on Ubuntu is "ATL_dJIK56x56x56TN56x56x0_a1_b1" as you can see from stack trace. This means roughly "JIK" which refers to the triple loop order (inner_k, row, col). The 56x56 refers to the block size. And the rest your guess is as good as mine. — Andrew Tomazos, Jan 27 '13 at 01:22
@user1131467: subtle technicalities aside, I think now it's pretty much clear that complete lack cache locality is the answer to your question. — akappa, Jan 27 '13 at 11:15
@akappa: No cache is only about 10x, the other 10x is SIMD and ILP as Eric explained accurately. — Andrew Tomazos, Jan 27 '13 at 11:51

score 6 · Answer 2 · answered Jan 26 '13 at 00:35

Strassen's algorithm has two advantages over the naïve algorithm:

Better time complexity in terms of number of operations, as other answers correctly point out;
It is a cache-oblivious algorithm. The difference in number of cache misses is in the order of B*M½, where B is the cache line size and M is the cache size.

I think that the second point accounts for a lot for the slowdown you are experiencing. If you are running your application under Linux, I suggest you run them with the perf tool, which tells you how many cache misses the program is experiencing.

Csq · Answer 3 · 2013-01-26T00:16:29.733

2

I don't know how reliable the information is but Wikipedia says that BLAS uses Strassen's algorithm for big matrixes. And yours are big indeed. That is around O(n^2.807) which is better than your O(n^3) naïve alogrithm.

edited Jan 26 '13 at 00:16

answered Jan 25 '13 at 23:59

Csq

5,775
6
26
39

1

`4096^2.807 = 1.38005e+10`, `4096^3 = 6.87195e+10`. Also, Strassen's constant overhead is higher. This only explains a small part of the difference seen by the OP. – us2012 Jan 26 '13 at 00:18
It would be (4096^2)^2.807 vs (4096^2)^3. – Puppy Jan 26 '13 at 00:19
1

@DeadMG: I think the n they use is the matrix width, not the input size. The naive algorithm is obviously O(width^3) not O(width^6) – Andrew Tomazos Jan 26 '13 at 00:22
@us2012 Yep, it should be around 5 times faster if the constants were the same. But they never are. Different algos will generate different optimization possibilities as well (pipeline, cache, memory alignment). – Csq Jan 26 '13 at 00:32
@Csq Of course. I'm not saying your answer is wrong, just that it's only a small part of the big picture. – us2012 Jan 26 '13 at 00:32
@us2012 Yes, I agree. Thanks for pointing out the calculation mistake! – Csq Jan 26 '13 at 00:36
I'm not seeing BLAS uses strassen's? – Mattwmaster58 Nov 06 '21 at 00:17

OSZ · Answer 4 · 2013-01-28T07:25:23.167

This is quite complex topic, and well answered by Eric, in the post above. I just want to point to a useful reference in this direction, page 84:

http://www.rrze.fau.de/dienste/arbeiten-rechnen/hpc/HPC4SE/

which suggests to make "loop unroll and jam" on top of blocking.

Can anyone explain this difference?

A general explanation is that, the ratio of the number of operations/number of data is O(N^3)/O(N^2). Thus matrix-matrix multiplication is a cache-bound algorithm, which means that you don't suffer from common memory-bandwidth bottleneck, for large matrix sizes. You can get up to 90% of peak performance of your CPU if the code well-optimized. So the optimization potential, elaborated by Eric, is tremendous as you observed. Actually, it would be very interesting to see the best performing code, and compile your final program with another compiler (intel usually brags to be the best).

score -2 · Answer 5 · answered Jan 26 '13 at 00:22

-2

About half of the difference is accounted for in algorithmic improvement. (4096*4096)^3 is the complexity of your algorithm, or 4.7x10^21, and (4096*4096)^2.807 is 1x10^20. That's a difference of about 47x.

The other 2x will be accounted for by more intelligent use of the cache, SSE instructions, and other such low-level optimizations.

Edit: I lie, n is width, not width^2. The algorithm would only actually account for about 4x, so there's still about another 22x to go. Threads, cache, and SSE instructions may well account for such things.

answered Jan 26 '13 at 00:22

Puppy

144,682
38
256
465

1

I think your numbers are wrong. The nieve algorithm is ~4096^3. Take a look at the loop. It is three nested for loops over 4096 range. – Andrew Tomazos Jan 26 '13 at 00:24
1

Plus we don't know the constant factors which may differ between the naive and stassen (by in fact any amount), so we cannot compare at this resolution. The Strassen algorithm is worth investigating though. – Andrew Tomazos Jan 26 '13 at 00:26
@user1131467: The [Wikipedia page](http://en.wikipedia.org/wiki/Matrix_multiplication) says Strassen is slightly better than direct multiplication around n = 100. This suggests the factor is 100**3 / 100**2.807 = 2.4. Obviously, this will vary tremendously depending on processor model, matrix sizes interacting with cache effects, and so on. Simple extrapolation shows that Strassen is about twice as good as direct multiplication at n = 4096. – Eric Postpischil Jan 26 '13 at 00:36
Depends on what kind of SSE, cache effects, and threading considerations are involved on a modern CPU. Even a few years since that was tested could change the answer. – Puppy Jan 26 '13 at 07:33
Just noticed the formatting problem in my old comment. (Did the interpretation of “\*\*” as bold change?) Here is an update: @user1131467: The [Wikipedia page](http://en.wikipedia.org/wiki/Matrix_multiplication) says Strassen is slightly better than direct multiplication around n = 100. This suggests the factor is 100^3 / 100^2.807 = 2.4. Obviously, this will vary tremendously depending on processor model, matrix sizes interacting with cache effects, and so on. Simple extrapolation shows that Strassen is about twice as good as direct multiplication at n = 4096. – Eric Postpischil Nov 14 '17 at 01:43

Why is a naïve C++ matrix multiplication 100 times slower than BLAS?

5 Answers5

Linked