tl;dr: Can someone explain the differences in performance shown in the table bellow?
Code Setting: There is an array of integers which is filled with values inside a for loop.
VS Project Settings: Two profiles(configuration settings) are used. The first is the default Release profile and the second, let's call it D_Release, is an exact copy of the Release profile with only one difference; the D_Release profile uses the Multi-threaded Debug DLL(/MDd).
I ran the code for the two profiles and for each profile I stored the array in the heap, stack and bss (so 6 different configurations).
The times I measured are the following:
------+---------+-----------
| | /MD | /MDd |
|------+---------+-----------|
| Heap | 8,5ms | 3,5ms |
|------+---------+-----------|
| Stack| 3,5ms | 3,5ms |
|------+---------+-----------|
| bss | 10ms | 10ms |
------+---------+-----------
[START EDIT]
After some comments, I measured the working set size just before the loop and got the following results
------+---------+-----------
| | /MD | /MDd |
|------+---------+-----------|
| Heap | 2.23mb | 40.6mb |
|------+---------+-----------|
| Stack| 40.4mb | 40.6mb |
|------+---------+-----------|
| bss | 2.17mb | 2.41mb |
------+---------+-----------
[END EDIT]
When using the Multi-threaded Debug DLL(/MD)(default Release profile) and storing the array in the heap the code runs much slower and in the bss I get slow performance in either profile.
Actual Question: I find it strange that the Debug dll gets runs faster. Can someone explain the differences in performance?
Extra Info: I tried manually defining and undefining the _DEBUG flag and making sure that the difference comes from the use of the different dll. I also used different timers (e.g. QueryPerformanceCounter), tried running the executables from the VS and from the command line and tried setting _NO_DEBUG_HEAP=1. I also used aligned malloc(_mm_malloc) to see if anything changed, but it didn't make a difference
Info on VS Runtime Libraries: http://msdn.microsoft.com/en-us/library/2kzt1wy3.aspx
Code Used
#include <iostream>
#include <chrono>
using std::cout;
using std::cerr;
using std::endl;
typedef std::chrono::high_resolution_clock hclock;
#define ALIGNMENT 32
#ifdef _MSC_VER
#define ALIGN __declspec(align(ALIGNMENT))
#else
#define ALIGN
#ifndef _mm_malloc
#define _mm_malloc(a, b) malloc(a)
#endif
#ifndef _mm_free
#define _mm_free(a) free(a)
#endif
#endif
#define HEAP 0
#define STACK 1
#define BSS 2
//SWITCH HERE
#define STORAGE 0
int main()
{
const size_t size = 10000000;
#if STORAGE == HEAP
cout << "Storing in the Heap\n";
int * a = (int*)_mm_malloc(sizeof(int)*size, ALIGNMENT);
#elif STORAGE == STACK
cout << "Storing in the Stack\n";
ALIGN int a[size];
#else
cout << "Storing in the BSS\n";
ALIGN static int a[size];
#endif
if ((int)a % ALIGNMENT)
{
cerr << "Data is not aligned" << endl;
}
//MAGIC STARTS HERE
hclock::time_point end, start = hclock::now();
for (unsigned int i = 0; i < size; ++i)
{
a[i] = i;
}
end = hclock::now();
//MAGIC ENDS HERE
cout << std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() << " us" << endl;
#if STORAGE == HEAP
_mm_free(a);
#endif
getchar();
return 0;
}