6

tl;dr: Can someone explain the differences in performance shown in the table bellow?

Code Setting: There is an array of integers which is filled with values inside a for loop.
VS Project Settings: Two profiles(configuration settings) are used. The first is the default Release profile and the second, let's call it D_Release, is an exact copy of the Release profile with only one difference; the D_Release profile uses the Multi-threaded Debug DLL(/MDd).

I ran the code for the two profiles and for each profile I stored the array in the heap, stack and bss (so 6 different configurations).

The times I measured are the following:

 ------+---------+-----------
|      |   /MD   |   /MDd    |
|------+---------+-----------|
| Heap |  8,5ms  |   3,5ms   |
|------+---------+-----------|
| Stack|  3,5ms  |   3,5ms   |
|------+---------+-----------|
| bss  |   10ms  |   10ms    |
 ------+---------+----------- 

[START EDIT]
After some comments, I measured the working set size just before the loop and got the following results

 ------+---------+-----------
|      |   /MD   |   /MDd    |
|------+---------+-----------|
| Heap |  2.23mb |  40.6mb   |
|------+---------+-----------|
| Stack|  40.4mb |  40.6mb   |
|------+---------+-----------|
| bss  |  2.17mb |  2.41mb   |
 ------+---------+----------- 

[END EDIT]

When using the Multi-threaded Debug DLL(/MD)(default Release profile) and storing the array in the heap the code runs much slower and in the bss I get slow performance in either profile.

Actual Question: I find it strange that the Debug dll gets runs faster. Can someone explain the differences in performance?

Extra Info: I tried manually defining and undefining the _DEBUG flag and making sure that the difference comes from the use of the different dll. I also used different timers (e.g. QueryPerformanceCounter), tried running the executables from the VS and from the command line and tried setting _NO_DEBUG_HEAP=1. I also used aligned malloc(_mm_malloc) to see if anything changed, but it didn't make a difference

Info on VS Runtime Libraries: http://msdn.microsoft.com/en-us/library/2kzt1wy3.aspx

Code Used

#include <iostream>
#include <chrono>

using std::cout;
using std::cerr;
using std::endl;
typedef std::chrono::high_resolution_clock hclock;

#define ALIGNMENT 32
#ifdef _MSC_VER
    #define ALIGN __declspec(align(ALIGNMENT))
#else
    #define ALIGN
    #ifndef _mm_malloc
    #define _mm_malloc(a, b) malloc(a)
    #endif
    #ifndef _mm_free
    #define _mm_free(a) free(a)
    #endif
#endif

#define HEAP 0
#define STACK 1
#define BSS 2
//SWITCH HERE
#define STORAGE 0

int main()
{
    const size_t size = 10000000;

#if STORAGE == HEAP
    cout << "Storing in the Heap\n";
    int * a = (int*)_mm_malloc(sizeof(int)*size, ALIGNMENT);
#elif STORAGE == STACK
    cout << "Storing in the Stack\n";
    ALIGN int a[size];
#else 
    cout << "Storing in the BSS\n";
    ALIGN static int a[size];
#endif 

    if ((int)a % ALIGNMENT)
    {
        cerr << "Data is not aligned" << endl;
    }

    //MAGIC STARTS HERE
    hclock::time_point end, start = hclock::now();
    for (unsigned int i = 0; i < size; ++i)
    {
        a[i] = i;
    }
    end = hclock::now();
    //MAGIC ENDS HERE

    cout << std::chrono::duration_cast<std::chrono::microseconds>(end - start).count() << " us" << endl;

#if STORAGE == HEAP
    _mm_free(a);
#endif

    getchar();
    return 0;
}
Apo_
  • 141
  • 1
  • 7
  • Did you try compiling the release with `/Ox`? – user657267 Dec 22 '14 at 02:37
  • Maybe compare assembly code between the different varieties – M.M Dec 22 '14 at 02:39
  • 5
    It's going to be something silly related to how artificial the test is. For example, perhaps the debug version is setting up the allocator before your timer starts and the release version is setting it up after. – David Schwartz Dec 22 '14 at 02:40
  • 3
    It might be related to lazy allocation. The debug library is probably initializing the data with garbage for the purpose of debugging. But in release, the pages aren't committed until first access. So you pass the overhead from setup to runtime. – Mysticial Dec 22 '14 at 02:41
  • @user657267 I tried it just now, but the results are the same. – Apo_ Dec 22 '14 at 02:42
  • 2
    Try running the "magic" loop twice and measure only the second run - do you get the same results? – Roman L Dec 22 '14 at 02:45
  • @MattMcNabb McNabb I forgot to mention in my post that I checked the assembly code close(some lines before and after) to the assignment, but it was the same. I will check again – Apo_ Dec 22 '14 at 02:46
  • 1
    consider getting the whole array cached and pre-faulted *outside* the timed region persistently, specifically not only due to debug-allocator initializing with marker-values in debug-mode. `memset(a, 0, size * sizeof *i);` might be enough. – Deduplicator Dec 22 '14 at 02:46
  • *"... undefining the _DEBUG flag..."* - You should define either `DEBUG` and `_DEBUG` or `NDEBUG`. Microsoft honors `NDEBUG` for Release configurations. See, for example, the preprocessor macros produced by a Visual Studio template or [_DEBUG vs NDEBUG](http://stackoverflow.com/q/2290509/608639) on Stack Overflow. – jww Dec 22 '14 at 02:59
  • Pre-initializing the array seems to be the key. With the array pre-initialized I get the same measurements across columns(in the table of the post) and the same measurements for heap and bss. The stack is slightly faster. – Apo_ Dec 22 '14 at 03:03
  • 1
    I measured the working set size and indeed it is a matter of lazy allocation. So, @Mysticial can you post your comment as an answer so that I can accept it. Other users gave the same answer, but I think your comment was the most clear. – Apo_ Dec 22 '14 at 16:25

0 Answers0