matrix multiplication speed depends on silly things

Question

I wrote a program for fast matrix multiplication. To use CPU cache to the maximal extent, it divides matrix to 10*10 cells, and multiplies each cell separately (increasing cell size to 20*20 or 50*50 does not significantly change time).

It turns out that the speed significantly depends on whether matrix size and cell size is know in advance or not.

The program is:

#include <cmath>
#include <cstdlib>
#include <iostream>

using namespace std;

#define forall(i,n) for(int i=0; i<(int)(n); i++)

inline void Load(int N, int N2, float* x2, float* x, int iStart, int jStart) {
    int start = iStart * N + jStart;
    forall (i, N2)
        forall (j, N2)
            x2[i * N2 + j] = x[start + i * N + j];
}

inline void Add(int N, int N2, float* x, float* x2, int iStart, int jStart) {
    int start = iStart * N + jStart;
    forall (i, N2)
        forall (j, N2)
            x[start + i * N + j] += x2[i * N2 + j];
}

inline void Mul(int N, float* z, float* x, float* y) {
    forall (i, N)
        forall (j, N) {
            double sum = 0;
            forall (k, N)
                sum += x[i*N+k] * y[k*N+j];
            z[i*N+j] = sum;
        }
}

inline double RandReal() {return random()/((double)RAND_MAX+1);}

int main(int argc, char** argv) {
#if VAR==3
    const int N = atoi(argv[1]), N2 = atoi(argv[2]);
#elif VAR==2
    const int N = 2000, N2 = atoi(argv[2]);
#elif VAR==1
    const int N = atoi(argv[1]), N2 = 10;
#elif VAR==0
    const int N = 2000, N2 = 10;
#else
    #error "Bad VAR"
#endif
    cout << "VAR=" << VAR << " N=" << N << " N2=" << N2 << endl;
    float x[N*N], y[N*N];
    forall (i, N)
        forall (j, N) {
            x[i*N+j] = RandReal();
            y[i*N+j] = RandReal();
        }
    float z[N*N];
    forall (i, N)
        forall (j, N)
            z[i*N+j] = 0;
    for (int i1 = 0; i1 < N; i1 += N2) {
        float x2[N2*N2], y2[N2*N2], z2[N2*N2];
        for (int j1 = 0; j1 < N; j1 += N2) {
            Load(N, N2, x2, x, i1, j1);
            for (int k1 = 0; k1 < N; k1 += N2) {
                Load(N, N2, y2, y, j1, k1);
                Mul(N2, z2, x2, y2);
                Add(N, N2, z, z2, i1, k1);
            }
        }
    }

    double val = 0, val2 = 0;
    forall (i, N)
        forall (j, N)
            val += z[i*N+j], val2 += z[i*N+j]*(i+j);
    cout << "val=" << val << " val2=" << val2 << endl;
}

Now the execution times:

$ for a in 0 1 2 3; do g++ -DVAR=$a -O3 -Wall -o mat mat.cpp; time ./mat 2000 10; done
VAR=0 N=2000 N2=10
val=2.00039e+09 val2=3.99867e+12

real    0m8.127s
user    0m8.108s
sys     0m0.020s
VAR=1 N=2000 N2=10
val=2.00039e+09 val2=3.99867e+12

real    0m3.304s
user    0m3.292s
sys     0m0.012s
VAR=2 N=2000 N2=10
val=2.00039e+09 val2=3.99867e+12

real    0m25.395s
user    0m25.388s
sys     0m0.008s
VAR=3 N=2000 N2=10
val=2.00039e+09 val2=3.99867e+12

real    0m25.515s
user    0m25.495s
sys     0m0.016s

In simple terms:

Don't know matrix size, know cell size: 3.3 seconds
Know both matrix size and cell size: 8.1 seconds
Don't know cell size: 25.5 seconds

Why is it? I use g++ 5.4.0 .

inline does not play any role, if we remove it, the results will be the same.

Have you looked at the difference in the generated assembly for each case? — Retired Ninja, Jun 30 '18 at 05:37
If things are known ahead of time, the compiler can do really cool things to the loops, possibly even compiling the loop out. Also worth noting that `float z[N*N];` with `N`=2000. will require 16,000,000 bytes of stack. Something you probably won't have if you haven't increased the stack size. — user4581301, Jun 30 '18 at 06:10
@user4581301 - the fastest version is when the compiler does NOT know things ahead of time. — user31264, Jun 30 '18 at 07:17
I cannot reproduce all of your results. Here, VAR=0/1 runs in 2.7sec, VAR=2/3 runs in ~7sec, so there is no difference whether N is known/unknown. — geza, Jun 30 '18 at 08:52
the main difference between N2 is known/unknown is that with N2 known, the compiler puts highly unrolled loops, so that kinda explains the difference — geza, Jun 30 '18 at 08:58
Btw, you can speed this up a little bit, if you use `float` uniformly (no `doubles`), because the float->double conversion takes time — geza, Jun 30 '18 at 10:06
It is very possible that the compiler makes a bad assumption when optimizing based on the matrix size. Just as an example, unrolling a loop of 10 with a small body is probably beneficial, but unrolling a loop of 2000 may be too large to fit in the instruction cache. The point is sort-of moot anyway. Unless you have allocated the program a stack size of around 50 MB somewhere outside the information presented in the question, the program is a big ball of Undefined Behaviour that you cannot trust to provide good numbers. — user4581301, Jun 30 '18 at 16:38
@user4581301 - I changed `float x[N*N]` to `float* X=new float[N*N]` (similar for `y` and `z`), nothing changed. — user31264, Jul 01 '18 at 08:47
Then UB was in your favour or your system is configured to offer an unusually large stack. For the speed difference, I'm sticking with my suspicion that the compiler makes an optimization decision that just doesn't work out on your hardware. What it is, I don't know. If you really want to know, you'll have to profile the two and find out. — user4581301, Jul 01 '18 at 20:16
@geza Ran some tests, my results similar to yours, on MSVC at least, rewrote my answer in consequence. — Paul Sanders, Jul 02 '18 at 12:38
@user4581301 Allocating such humungous objects on the stack works for me if I increase `ulimit`, and then I get the same timings as I do if I allocate these objects on the heap, see rewritten answer. — Paul Sanders, Jul 02 '18 at 12:39
@PaulSanders: unfortunately, these timing numbers depend a lot on the compiler. With clang, I got very different results (N2 known is much worse than gcc's, N2 unknown is better - so the difference between N2 known/unknown is smaller with clang) — geza, Jul 02 '18 at 13:01
@geza Yes, I'd certainly expect detailed differences, although the _pattern_ across all three compilers seems to be the same (and it sounds like what you saw with clang is similar to what I saw with (Apple's) clang). Do you want to edit your own timings into my answer? Or post an answer of your own, just with those timings in, for reference? As I say, I can't easily test with gcc, unfortunately. But what I don't get about all this is just how **totally weird** the _OP's_ timings are. Since neither of us see anything like that, we're really just guessing there. — Paul Sanders, Jul 02 '18 at 14:49
@PaulSanders: Sorry, I kinda lost the interest of this question. We see the effect of loop unrolling here, which easily causes such big differences. As for why OP has some weird results... dunno, but it is hard to figure this out without having access to that machine. — geza, Jul 02 '18 at 18:15
@geza _I kinda lost the interest of this question._ Yeah, me too. — Paul Sanders, Jul 02 '18 at 20:09

Paul Sanders · Answer 1 · 2018-07-02T20:13:04.377

Introductory note: Much of this post has been re-written, so some of the comments below won't make too much sense anymore. Please look behind the edit for details, if you care. So...

tl;dr

profile to find hot loops
use constant counts for these if possible
try manually unrolling them if not
OP's results a mystery, nobody else can reproduce anything so extreme

I agree with @user4581301 - the more the compiler knows ahead of time, the more it can do for you in terms of optimising your code.

But you need to profile this code - wall clock times will only take you so far. I don't know a whole lot about profilers for gcc (I have a good one for MSVC) but you could try your luck here.

It also pays (as @RetiredNinja said right off the bat) to try to learn some assembler, using Godbolt as a tool, especially if you want to understand a slowdown as dramatic as this.

Now having said all that, your timings make no sense to me so something strange is going on on your system. So I ran some tests of my own, and my results differ markedly from yours. I ran some of these tests on MSVC (because I have such wonderful profiling tools there) and some on gcc on the Mac (although I think that is actually secretly clang under the hood). I have no linux box, sorry.

Firstly, let's deal with the issue of allocating such large objects on the stack. This is obviously not wise, and I can't do it at all on MSVC since it doesn't support variable length arrays, but my tests on the Mac showed that this made no difference to the timings once I had increased ulimit to get it to work at all (see here). So I think we can discount that as a factor, as you yourself say in the comments.

So now let's look at the timings I got on the Mac:

VAR=0 USE_STACK=0 N=2000 (known) N2=10 (known)
user    0m10.813s

VAR=1 USE_STACK=0 N=2000 (unknown) N2=10 (known)
user    0m11.008s

VAR=2 USE_STACK=0 N=2000 (known) N2=10 (unknown)
user    0m12.714s

VAR=3 USE_STACK=0 N=2000 (unknown) N2=10 (unknown)
user    0m12.778s

VAR=0 USE_STACK=1 N=2000 (known) N2=10 (known)
user    0m10.617s

VAR=1 USE_STACK=1 N=2000 (unknown) N2=10 (known)
user    0m10.987s

VAR=2 USE_STACK=1 N=2000 (known) N2=10 (unknown)
user    0m12.653s

VAR=3 USE_STACK=1 N=2000 (unknown) N2=10 (unknown)
user    0m12.673s

Not much to see there; let's move on to what I observed on MSVC (where I can profile):

VAR=0 USE_STACK=0 N=2000 (known) N2=10 (known)
Elapsed: 0:00:06.89

VAR=1 USE_STACK=0 N=2000 (unknown) N2=10 (known)
Elapsed: 0:00:06.86

VAR=2 USE_STACK=0 N=2000 (known) N2=10 (unknown)
Elapsed: 0:00:10.24

VAR=3 USE_STACK=0 N=2000 (unknown) N2=10 (unknown)
Elapsed: 0:00:10.39

Now we have something we can get our teeth into. Like @geza observed, the code takes longer to run when N2 isn't known, which is entirely in line with what one might expect since this is where the hot loops will be, and it's much more likely that the compiler will unroll such a loop when it knows that it's actually comprised of a small, known number of iterations.

So let's get some information from the profiler. It tells me that the hot loop (by quite a long way) is the inner loop in Mul():

inline void Mul(int N, float* z, float* x, float* y) {
    forall (i, N)
        forall (j, N) {
            double sum = 0;

=> forall (k, N) => sum += x[i*N+k] * y[kN+j]; z[iN+j] = sum; } }

Again, I can't say this surprises me much, and when I look at the code I can see that the loop is not unrolled at all (loop setup code omitted for brevity):

$1:
    movss       xmm0,dword ptr [r9+rsi*4]  
    mulss       xmm0,dword ptr [r8+4]  
    movss       xmm1,dword ptr [r9+r15*4]  
    mulss       xmm1,dword ptr [r8]  
    cvtps2pd    xmm2,xmm0  
    cvtps2pd    xmm0,xmm1  
    movss       xmm1,dword ptr [r8+8]  
    mulss       xmm1,dword ptr [r9]  
    addsd       xmm0,xmm3  
    addsd       xmm2,xmm0  
    cvtps2pd    xmm0,xmm1  
    movss       xmm1,dword ptr [r9+r14*4]  
    movaps      xmm3,xmm2  
    mulss       xmm1,dword ptr [r8+0Ch]  
    add         r9,rbp  
    add         r8,10h  
    addsd       xmm3,xmm0  
    cvtps2pd    xmm0,xmm1  
    addsd       xmm3,xmm0  
    sub         rcx,1  
    jne         $1

Now it doesn't look that there would be any savings to be made there by unrolling that loop since looping around is going to be cheap compared to executing all the rest of the code in there, but if you look at the disassembly of the same loop when N2 is known, you get a surprise:

    movss       xmm0,dword ptr [rax-8]  
    mulss       xmm0,dword ptr [rcx-50h]  
    cvtps2pd    xmm2,xmm0  
    movss       xmm0,dword ptr [rcx-28h]  
    mulss       xmm0,dword ptr [rax-4]  
    addsd       xmm2,xmm7  
    cvtps2pd    xmm1,xmm0  
    movss       xmm0,dword ptr [rcx]  
    mulss       xmm0,dword ptr [rax]  
    addsd       xmm2,xmm1  
    cvtps2pd    xmm1,xmm0  
    movss       xmm0,dword ptr [rcx+28h]  
    mulss       xmm0,dword ptr [rax+4]  
    addsd       xmm2,xmm1  
    cvtps2pd    xmm1,xmm0  
    movss       xmm0,dword ptr [rcx+50h]  
    mulss       xmm0,dword ptr [rax+8]  
    addsd       xmm2,xmm1  
    cvtps2pd    xmm1,xmm0  
    movss       xmm0,dword ptr [rcx+78h]  
    mulss       xmm0,dword ptr [rax+0Ch]  
    addsd       xmm2,xmm1  
    cvtps2pd    xmm1,xmm0  
    movss       xmm0,dword ptr [rcx+0A0h]  
    mulss       xmm0,dword ptr [rax+10h]  
    addsd       xmm2,xmm1  
    cvtps2pd    xmm1,xmm0  
    movss       xmm0,dword ptr [rcx+0C8h]  
    mulss       xmm0,dword ptr [rax+14h]  
    addsd       xmm2,xmm1  
    cvtps2pd    xmm1,xmm0  
    movss       xmm0,dword ptr [rcx+0F0h]  
    mulss       xmm0,dword ptr [rax+18h]  
    addsd       xmm2,xmm1  
    cvtps2pd    xmm1,xmm0  
    movss       xmm0,dword ptr [rcx+118h]  
    mulss       xmm0,dword ptr [rax+1Ch]  
    addsd       xmm2,xmm1  
    cvtps2pd    xmm1,xmm0  
    addsd       xmm2,xmm1  
    cvtpd2ps    xmm0,xmm2  
    movss       dword ptr [rdx+rcx],xmm0

Now there is no loop, and the number of instructions that will be executed overall is clearly reduced. Perhaps MS are not such a bunch of dumb clucks after all.

So finally, as an exercise, let's just quickly unroll that loop manually and see what timings we get (I didn't look at the generated code):

#define UNROLL_LOOP 1

inline void Mul(int N, float* z, float* x, float* y) {
    forall (i, N)
        forall (j, N) {
            double sum = 0;
#if UNROLL_LOOP
            assert (N == 10);
            sum += x[i*N] * y[0*N+j];
            sum += x[i*N+1] * y[1*N+j];
            sum += x[i*N+2] * y[2*N+j];
            sum += x[i*N+3] * y[3*N+j];
            sum += x[i*N+4] * y[4*N+j];
            sum += x[i*N+5] * y[5*N+j];
            sum += x[i*N+6] * y[6*N+j];
            sum += x[i*N+7] * y[7*N+j];
            sum += x[i*N+8] * y[8*N+j];
            sum += x[i*N+9] * y[9*N+j];
#else
            forall (k, N)
                sum += x[i*N+k] * y[k*N+j];
#endif
            z[i*N+j] = sum;
        }
}

And when I did that, I got:

VAR=3 USE_STACK=0 N=2000 (unknown) N2=10 (unknown)
Elapsed: 0:00:07.48 (compared with 10.39 / 6.86, not bad, more may be possible).

So that's the process you need to go through to analyse performance issues like this, and you need good tools. I don't know what's happening in your case, because I can't reproduce the issue, but loop unrolling is (as expected) the primary factor on MSVC when (small) loop counts are unknown.

The test code I used is here in case anyone wants to refer to it. I think you owe me a vote, OP.

Edit:

Played around a little bit at Wandbox with gcc 9.0.0. Timings (these are rather slower and a bit more imprecise since we are running on a shared box, or, more likely, in a VM):

VAR=0 USE_STACK=0 N=2000 (known) N2=10 (known) Elapsed time = ~8sec

VAR=3 USE_STACK=0 N=2000 (unknown) N2=10 (unknown) Elapsed time = ~15.5sec

VAR=3 USE_STACK=0 N=2000 (unknown) N2=10 (unknown), loop unrolled Elapsed time = ~ 13.5sec

So that needs a bit more investigation - with a profiler and by looking at the generated code - and is still a million miles away from what the OP is getting.

Live demo - you can play around with that yourself if you want to try different things, OP.

I knew it! I knew from experience that somebody who doesn't know what he is talking about, will still try to answer the question. "For all I know, you're spending half your time in RandReal()." - then you know NOTHING, I repeat NOTHING. All the RandReal() stuff is O(N^2), while the multiplication is O(N^3). 8000000 calls to RandReal(), each one takes 5-20 nanoseconds. To be absolutely sure (maybe you know something which I don't?), I replaced calls to RandReal() by just 0.5 , and it did not change the time significantly (still 8.1, 3.3, and 25.4 seconds). Downvote. — user31264, Jun 30 '18 at 07:29
As for suggestions to do manual loop unrolling, (1) g++ can do loop unrolling itself (2) it misses the point. Before doing manual optimization, I want to know what is going on, that's was the point of my question, not how could I improve the speed by blind trials and errors. — user31264, Jun 30 '18 at 07:32
Wow, where did all _that_ come from? I think you rather missed the point there. Profile the code. And maybe take a break / get some sleep. — Paul Sanders, Jun 30 '18 at 07:32
Oh, yes, I nearly forgot in all the excitement. If you _really_ want know what the compiler is doing with your code, gaining some rudimentary knowledge of assembler would be very valuable. Then - after profiling - you can easily check if gcc has unrolled your most critical loop(s) or not (OK, well, clearly here your timings show it hasn't) and then decide _for yourself_ what to do. No blind trial and error there. You can inspect code generated by the compiler quickly and easily at [Godbolt](https://godbolt.org/). Still think I don't know what I'm doing? Honestly? — Paul Sanders, Jun 30 '18 at 07:42
[...Time passes...] Edited my answer, I do accept it could have used a little improvement. You should consider voting back up - you have made yourself look a little foolish here. ... Oh you did. Cool, thank you. Let's hope we can get over getting off on the wrong foot. I hate to be at odds with people. Could even vote up :) — Paul Sanders, Jun 30 '18 at 08:10
Can't profile this effectively in MSVC. Code requires variable length arrays, and here Visual Studio sticks more closely to the C++ Standard and doesn't allow them. Changing Swapping the stack allocations out for heap would radically change the behaviour of the program. — user4581301, Jun 30 '18 at 16:26
@user4581301 _Changing Swapping the stack allocations out for heap would radically change the behaviour of the program._ What makes you say that please? It did cross my mind too, but then I dismissed it, perhaps wrongly. Locality of reference? I actually rather _want_ to profile this code on MSVC, although I'm not sure quite how valid a test that would be, given we are questioning (in part, anyway) the quality of the code produced by the compiler. — Paul Sanders, Jun 30 '18 at 18:46
Because the program throws over 48 MB on the stack. Typical *nix stack is < 10 MB, and there's nothing in the question that suggests this was increased, so odds are good the program is broken. — user4581301, Jun 30 '18 at 20:54

matrix multiplication speed depends on silly things

1 Answers1