During my little performance issues investigation, I noticed an interesting stack allocation feature, here it is template for measure time:
#include <chrono>
#include <iostream>
using namespace std;
using namespace std::chrono;
int x; //for simple optimization suppression
void foo();
int main()
{
const size_t n = 10000000; //ten millions
auto start = high_resolution_clock::now();
for (size_t i = 0; i < n; i++)
{
foo();
}
auto finish = high_resolution_clock::now();
cout << duration_cast<milliseconds>(finish - start).count() << endl;
}
Now it's all about foo()
implementation, in each implementation will be allocated in total 500000 ints
:
Allocated in one chunk:
void foo() { const int size = 500000; int a1[size]; x = a1[size - 1]; }
Result: 7.3 seconds;
Allocated in two chunks:
void foo() { const int size = 250000; int a1[size]; int a2[size]; x = a1[size - 1] + a2[size - 1]; }
Result: 3.5 seconds;
Allocated in four chunks:
void foo() { const int size = 125000; int a1[size]; int a2[size]; int a3[size]; int a4[size]; x = a1[size - 1] + a2[size - 1] + a3[size - 1] + a4[size - 1]; }
Result: 1.8 seconds.
and etc... I split it in 16 chunks and get result time 0.38 seconds.
Explain it to me, please, why and how this happens?
I used MSVC 2013 (v120), Release build.
UPD:
My machine is x64 platform. And I compiled it with Win32 platform.
When I compile it with x64 Platform then it yields in all cases about 40ms.
Why platform choice so much affect?