Stack allocation feature (performance)

Question

During my little performance issues investigation, I noticed an interesting stack allocation feature, here it is template for measure time:

#include <chrono>
#include <iostream>

using namespace std;
using namespace std::chrono;

int x; //for simple optimization suppression
void foo();

int main()
{   
    const size_t n = 10000000; //ten millions
    auto start = high_resolution_clock::now();

    for (size_t i = 0; i < n; i++)
    {
        foo();
    }

    auto finish = high_resolution_clock::now();
    cout << duration_cast<milliseconds>(finish - start).count() << endl;
}

Now it's all about foo() implementation, in each implementation will be allocated in total 500000 ints:

Allocated in one chunk:

void foo()
{
    const int size = 500000;
    int a1[size];

    x = a1[size - 1];
}

Result: 7.3 seconds;

Allocated in two chunks:

void foo()
{
    const int size = 250000;
    int a1[size];
    int a2[size];

    x = a1[size - 1] + a2[size - 1];
}

Result: 3.5 seconds;

Allocated in four chunks:

void foo()
{
    const int size = 125000;
    int a1[size];
    int a2[size];
    int a3[size];
    int a4[size];

    x = a1[size - 1] + a2[size - 1] +
        a3[size - 1] + a4[size - 1];
}

Result: 1.8 seconds.

and etc... I split it in 16 chunks and get result time 0.38 seconds.

Explain it to me, please, why and how this happens?
I used MSVC 2013 (v120), Release build.

UPD:
My machine is x64 platform. And I compiled it with Win32 platform.
When I compile it with x64 Platform then it yields in all cases about 40ms.
Why platform choice so much affect?

Blindly: Apart from possible Compiler optimization. The perfomances you see there was certainly crippled by **Cache Misses**... Usually, you should do your benchmarks at the highest optimization levels. :-) — WhiZTiM, Aug 09 '16 at 00:55
What are your PC specs? Compiler version, and Compilation flags? — WhiZTiM, Aug 09 '16 at 01:00
@WhiZTiM, I'm tried to avoid optimization:) Can you suggest improvement to precisely avoid _Compiler optimization_ and _Cache Misses_? — MrPisarik, Aug 09 '16 at 01:00
Contrary to what you got. I ran your code... With high optimization level `-O3`. I got `21ms` for the first, `31ms` for the second, and `23ms` for the third. Counter-intuitive right? — WhiZTiM, Aug 09 '16 at 01:07
@WhiZTiM, Ok. In MSVC I opened properties of project and copied flags from C/C++ -> Command Line: `/GS /GL /analyze- /W3 /Gy /Zc:wchar_t /Zi /Gm- /O2 /Fd"Release\vc120.pdb" /fp:precise /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_LIB" /D "_UNICODE" /D "UNICODE" /errorReport:prompt /WX- /Zc:forScope /Gd /Oy- /Oi /MD /Fa"Release\" /EHsc /nologo /Fo"Release\" /Fp"Release\Temp.pch"`. And typed in console `cl /?`: `Microsoft C/C++ Compiler 15.00.21022.08 for x64`. But I used Win32 Platform — MrPisarik, Aug 09 '16 at 01:07
@WhiZTiM, Yeah, counter-intuitive! Unfortunately I don't see `-O3` level in MSVC 2013. There is `"Full optimization /Ox "` and it gives same results as in question — MrPisarik, Aug 09 '16 at 01:11
You will have to check out the machine code to make sense of the behavior. It doesn't make much sense because the stack frame size should be the same in all cases, and initializations for debugging should make the four-split just microscopically slower instead of much faster. But as others have pointed out, this is behavior that can't be relied on: it varies with optimizations etc. — Cheers and hth. - Alf, Aug 09 '16 at 02:44
I've been looking at it on GCC 4.9.2 on Linux, and it produces exactly the result you'd expect. At all optimization levels, allocating the single array `a1` is faster than the four arrays `a1...a4`, using your code. In all cases, the entire function call takes less than 40ms, rather than the several seconds you're experiencing. The difference in times between all of the above is marginal - about 15ms at best. — Sebastian Lenartowicz, Aug 09 '16 at 03:03
Compiler writers *love* UB, it provides so many ways to make code run faster. You are reading a value that was never written. The optimizer notices this, knows that any value is good enough so *only* reads a1[0]. Which in turn allows a2, a3 and a4 to be eliminated. Which makes the stack frame smaller. Which makes it spend less time on _chkstk. So execution time is proportional to the value of `size`. — Hans Passant, Aug 09 '16 at 07:24
@HansPassant, I like your comment and think that it should be an answer. Maybe you can also assume cause of result in **UPD:** part? — MrPisarik, Aug 09 '16 at 13:44

score 8 · Accepted Answer · edited May 23 '17 at 12:33

8

Looking at disassembly from VS2015 Update 3, in the 2 and 4 array versions of foo, the compiler optimizes out the unused arrays so that it only reserves stack space for 1 array in each function. Since the later functions have smaller arrays this takes less time. The assignment to x reads the same memory location for both/all 4 arrays. (Since the arrays are uninitialized, reading from them is undefined behavior.) Without optimizing the code there are 2 or 4 distinct arrays that are read from.

The long time taken for these functions is due to stack probes performed by __chkstk as part of stack overflow detection (necessary when the compiler needs more than 1 page of space to hold all the local variables).

edited May 23 '17 at 12:33

Community

1
1

answered Aug 09 '16 at 04:03

1201ProgramAlarm

32,384
7
42
56

Yeah, that's it! But why speed so dramatically increases if I change target platform on x64? I expect that changing platform to x64 increases page size twice, i.e. it's become 8K. So speed must increase, but there is two cases one in question (dramatically increase) another if you add to each array declaration `= { 0 }` (and delete two nulls from n to reduce waiting) then there is no speed increase between platforms (but all 3 cases works in one time). Maybe it's because the cost of allocating so small relative to the rest? – MrPisarik Aug 09 '16 at 10:58

score 1 · Answer 2 · answered Aug 09 '16 at 13:29

You should look at the resulting assembler code, to see what your compiler really does with the code. For gcc/clang/icc you can use Matt Godbolt's Compiler Explorer.

clang optimizes everything out because of UB and the result is (foo - first version, foo2 - second version:

foo:                                    # @foo
        retq

foo2:                                   # @foo2
        retq

icc treats both versions very similar:

foo:
        pushq     %rbp                                          #4.1
        movq      %rsp, %rbp                                    #4.1
        subq      $2000000, %rsp                                #4.1
        movl      -4(%rbp), %eax                                #8.9
        movl      %eax, x(%rip)                                 #8.5
        leave                                                   #10.1
        ret                                                     #10.1

foo2:
        pushq     %rbp                                          #13.1
        movq      %rsp, %rbp                                    #13.1
        subq      $2000000, %rsp                                #13.1
        movl      -1000004(%rbp), %eax                          #18.9
        addl      -4(%rbp), %eax                                #18.24
        movl      %eax, x(%rip)                                 #18.5
        leave                                                   #19.1
        ret

and gcc creates different assembler code for different version. Version 6.1 produces code which would show similar behavior as your experiments:

foo:
        pushq   %rbp
        movq    %rsp, %rbp
        subq    $2000016, %rsp
        movl    1999996(%rsp), %eax
        movl    %eax, x(%rip)
        leave
        ret
foo2:
        pushq   %rbp
        movl    $1000016, %edx  #only the first array is allocated
        movq    %rsp, %rbp
        subq    %rdx, %rsp
        leaq    3(%rsp), %rax
        subq    %rdx, %rsp
        shrq    $2, %rax
        movl    999996(,%rax,4), %eax
        addl    999996(%rsp), %eax
        movl    %eax, x(%rip)
        leave
        ret

Thus the only way to understand the difference is to look at the assembler code produced by your compiler, everything else is just guessing.

Stack allocation feature (performance)

2 Answers2