Introductory note: Much of this post has been re-written, so some of the comments below won't make too much sense anymore. Please look behind the edit for details, if you care. So...
tl;dr
- profile to find hot loops
- use constant counts for these if possible
- try manually unrolling them if not
- OP's results a mystery, nobody else can reproduce anything so extreme
I agree with @user4581301 - the more the compiler knows ahead of time, the more it can do for you in terms of optimising your code.
But you need to profile this code - wall clock times will only take you so far. I don't know a whole lot about profilers for gcc (I have a good one for MSVC) but you could try your luck here.
It also pays (as @RetiredNinja said right off the bat) to try to learn some assembler, using Godbolt as a tool, especially if you want to understand a slowdown as dramatic as this.
Now having said all that, your timings make no sense to me so something strange is going on on your system. So I ran some tests of my own, and my results differ markedly from yours. I ran some of these tests on MSVC (because I have such wonderful profiling tools there) and some on gcc on the Mac (although I think that is actually secretly clang under the hood). I have no linux box, sorry.
Firstly, let's deal with the issue of allocating such large objects on the stack. This is obviously not wise, and I can't do it at all on MSVC since it doesn't support variable length arrays, but my tests on the Mac showed that this made no difference to the timings once I had increased ulimit
to get it to work at all (see here). So I think we can discount that as a factor, as you yourself say in the comments.
So now let's look at the timings I got on the Mac:
VAR=0 USE_STACK=0 N=2000 (known) N2=10 (known)
user 0m10.813s
VAR=1 USE_STACK=0 N=2000 (unknown) N2=10 (known)
user 0m11.008s
VAR=2 USE_STACK=0 N=2000 (known) N2=10 (unknown)
user 0m12.714s
VAR=3 USE_STACK=0 N=2000 (unknown) N2=10 (unknown)
user 0m12.778s
VAR=0 USE_STACK=1 N=2000 (known) N2=10 (known)
user 0m10.617s
VAR=1 USE_STACK=1 N=2000 (unknown) N2=10 (known)
user 0m10.987s
VAR=2 USE_STACK=1 N=2000 (known) N2=10 (unknown)
user 0m12.653s
VAR=3 USE_STACK=1 N=2000 (unknown) N2=10 (unknown)
user 0m12.673s
Not much to see there; let's move on to what I observed on MSVC (where I can profile):
VAR=0 USE_STACK=0 N=2000 (known) N2=10 (known)
Elapsed: 0:00:06.89
VAR=1 USE_STACK=0 N=2000 (unknown) N2=10 (known)
Elapsed: 0:00:06.86
VAR=2 USE_STACK=0 N=2000 (known) N2=10 (unknown)
Elapsed: 0:00:10.24
VAR=3 USE_STACK=0 N=2000 (unknown) N2=10 (unknown)
Elapsed: 0:00:10.39
Now we have something we can get our teeth into. Like @geza observed, the code takes longer to run when N2
isn't known, which is entirely in line with what one might expect since this is where the hot loops will be, and it's much more likely that the compiler will unroll such a loop when it knows that it's actually comprised of a small, known number of iterations.
So let's get some information from the profiler. It tells me that the hot loop (by quite a long way) is the inner loop in Mul()
:
inline void Mul(int N, float* z, float* x, float* y) {
forall (i, N)
forall (j, N) {
double sum = 0;
=> forall (k, N)
=> sum += x[i*N+k] * y[kN+j];
z[iN+j] = sum;
}
}
Again, I can't say this surprises me much, and when I look at the code I can see that the loop is not unrolled at all (loop setup code omitted for brevity):
$1:
movss xmm0,dword ptr [r9+rsi*4]
mulss xmm0,dword ptr [r8+4]
movss xmm1,dword ptr [r9+r15*4]
mulss xmm1,dword ptr [r8]
cvtps2pd xmm2,xmm0
cvtps2pd xmm0,xmm1
movss xmm1,dword ptr [r8+8]
mulss xmm1,dword ptr [r9]
addsd xmm0,xmm3
addsd xmm2,xmm0
cvtps2pd xmm0,xmm1
movss xmm1,dword ptr [r9+r14*4]
movaps xmm3,xmm2
mulss xmm1,dword ptr [r8+0Ch]
add r9,rbp
add r8,10h
addsd xmm3,xmm0
cvtps2pd xmm0,xmm1
addsd xmm3,xmm0
sub rcx,1
jne $1
Now it doesn't look that there would be any savings to be made there by unrolling that loop since looping around is going to be cheap compared to executing all the rest of the code in there, but if you look at the disassembly of the same loop when N2
is known, you get a surprise:
movss xmm0,dword ptr [rax-8]
mulss xmm0,dword ptr [rcx-50h]
cvtps2pd xmm2,xmm0
movss xmm0,dword ptr [rcx-28h]
mulss xmm0,dword ptr [rax-4]
addsd xmm2,xmm7
cvtps2pd xmm1,xmm0
movss xmm0,dword ptr [rcx]
mulss xmm0,dword ptr [rax]
addsd xmm2,xmm1
cvtps2pd xmm1,xmm0
movss xmm0,dword ptr [rcx+28h]
mulss xmm0,dword ptr [rax+4]
addsd xmm2,xmm1
cvtps2pd xmm1,xmm0
movss xmm0,dword ptr [rcx+50h]
mulss xmm0,dword ptr [rax+8]
addsd xmm2,xmm1
cvtps2pd xmm1,xmm0
movss xmm0,dword ptr [rcx+78h]
mulss xmm0,dword ptr [rax+0Ch]
addsd xmm2,xmm1
cvtps2pd xmm1,xmm0
movss xmm0,dword ptr [rcx+0A0h]
mulss xmm0,dword ptr [rax+10h]
addsd xmm2,xmm1
cvtps2pd xmm1,xmm0
movss xmm0,dword ptr [rcx+0C8h]
mulss xmm0,dword ptr [rax+14h]
addsd xmm2,xmm1
cvtps2pd xmm1,xmm0
movss xmm0,dword ptr [rcx+0F0h]
mulss xmm0,dword ptr [rax+18h]
addsd xmm2,xmm1
cvtps2pd xmm1,xmm0
movss xmm0,dword ptr [rcx+118h]
mulss xmm0,dword ptr [rax+1Ch]
addsd xmm2,xmm1
cvtps2pd xmm1,xmm0
addsd xmm2,xmm1
cvtpd2ps xmm0,xmm2
movss dword ptr [rdx+rcx],xmm0
Now there is no loop, and the number of instructions that will be executed overall is clearly reduced. Perhaps MS are not such a bunch of dumb clucks after all.
So finally, as an exercise, let's just quickly unroll that loop manually and see what timings we get (I didn't look at the generated code):
#define UNROLL_LOOP 1
inline void Mul(int N, float* z, float* x, float* y) {
forall (i, N)
forall (j, N) {
double sum = 0;
#if UNROLL_LOOP
assert (N == 10);
sum += x[i*N] * y[0*N+j];
sum += x[i*N+1] * y[1*N+j];
sum += x[i*N+2] * y[2*N+j];
sum += x[i*N+3] * y[3*N+j];
sum += x[i*N+4] * y[4*N+j];
sum += x[i*N+5] * y[5*N+j];
sum += x[i*N+6] * y[6*N+j];
sum += x[i*N+7] * y[7*N+j];
sum += x[i*N+8] * y[8*N+j];
sum += x[i*N+9] * y[9*N+j];
#else
forall (k, N)
sum += x[i*N+k] * y[k*N+j];
#endif
z[i*N+j] = sum;
}
}
And when I did that, I got:
VAR=3 USE_STACK=0 N=2000 (unknown) N2=10 (unknown)
Elapsed: 0:00:07.48 (compared with 10.39 / 6.86, not bad, more may be possible).
So that's the process you need to go through to analyse performance issues like this, and you need good tools. I don't know what's happening in your case, because I can't reproduce the issue, but loop unrolling is (as expected) the primary factor on MSVC when (small) loop counts are unknown.
The test code I used is here in case anyone wants to refer to it. I think you owe me a vote, OP.
Edit:
Played around a little bit at Wandbox with gcc 9.0.0. Timings (these are rather slower and a bit more imprecise since we are running on a shared box, or, more likely, in a VM):
VAR=0 USE_STACK=0 N=2000 (known) N2=10 (known)
Elapsed time = ~8sec
VAR=3 USE_STACK=0 N=2000 (unknown) N2=10 (unknown)
Elapsed time = ~15.5sec
VAR=3 USE_STACK=0 N=2000 (unknown) N2=10 (unknown), loop unrolled
Elapsed time = ~ 13.5sec
So that needs a bit more investigation - with a profiler and by looking at the generated code - and is still a million miles away from what the OP is getting.
Live demo - you can play around with that yourself if you want to try different things, OP.