Visual C++ SSE function slow when starting without debugger

Question

i have a pretty weird problem regarding SSE usage.

I wrote the following function where i use SSE to calculate the maximum of the difference of two float arrays, each containing 64 floats.

The dists-array is a 2d-array allocated via _aligned_malloc.

#include <iostream>
#include <xmmintrin.h>
#include <time.h>
#include <stdio.h>
#include <algorithm>
#include <fstream>

#include "hr_time.h"

using namespace std;

float** dists;
float** dists2;
__m128* a;
__m128* b;
__m128* c;
__m128* d;
__m128 diff;
__m128 diff2;
__m128 mymax;
float* myfmax;

float test(int s, int t)
{
    a = (__m128*) dists[s];
    b = (__m128*) dists[t];
    c = (__m128*) dists2[s];
    d = (__m128*) dists2[t];

    diff;
    mymax = _mm_set_ps(0.0, 0.0, 0.0, 0.0);
    for (int i = 0; i <= 16; i++)
    {
        diff = _mm_sub_ps(*a, *b);
        mymax = _mm_max_ps(diff, mymax);

        diff2 = _mm_sub_ps(*d, *c);
        mymax = _mm_max_ps(diff2, mymax);

        a++;
        b++;
        c++;
        d++;
    }

    _mm_store_ps(myfmax, mymax);
    float res = max(max(max(myfmax[0], myfmax[1]), myfmax[2]), myfmax[3]);
    return res;
}

int Deserialize(std::istream* stream)
{
    int numOfElements, arraySize;

    stream->read((char*)&numOfElements, sizeof(int)); // numOfElements = 64
    stream->read((char*)&arraySize, sizeof(int)); // arraySize = 8000000 

    dists = (float**)_aligned_malloc(arraySize * sizeof(float*), 16);
    dists2 = (float**)_aligned_malloc(arraySize * sizeof(float*), 16);
    for (int j = 0; j < arraySize; j++)
    {
        dists[j] = (float*)_aligned_malloc(numOfElements * sizeof(float), 16);
        dists2[j] = (float*)_aligned_malloc(numOfElements * sizeof(float), 16);
    }

    for (int i = 0; i < arraySize; i++)
    {
        stream->read((char*)dists[i], (numOfElements*sizeof(float)));
    }

    for (int i = 0; i < arraySize; i++)
    {
        stream->read((char*)dists2[i], (numOfElements*sizeof(float)));
    }

    return 0;
}

int main(int argc, char** argv)
{
    int entries = 8000000;

    myfmax = (float*)_aligned_malloc(4 * sizeof(float), 16);
    ifstream fs("binary_file", std::ios::binary);
    Deserialize(&fs);

    CStopWatch* watch = new CStopWatch();
    watch->StartTimer();
    int i;
    for (i = 0; i < entries; i++)
    {
        int s = rand() % entries;
        int t = rand() % entries;
        test(s, t);
    }
    watch->StopTimer();
    cout << i << " iterations took " << watch->GetElapsedTimeMs() << "ms" << endl;

    cin.get();
}

My problem is, that this code runs very fast if i run it in Visual Studio with an attached debugger. But as soon as i execute it without the debugger it gets very slow. So i did a little reasearch and found out that one difference between those two starting methods is the "Debug Heap". So i disabled that by defining "_NO_DEBUG_HEAP=1". With that option i get very poor performance with an attached debugger too.

But i don't understand how i can get better performance by using the Debug Heap? And i don't know how to solve this problem, so i hope one of you guys can help me.

Thanks in advance.

Regards, Karsten

Are you compiling for release or debug? How large are the allocations? What is "fast" and "very slow" - how are you measuring this, and over what code? What is `diff;` in the middle of your code? Sorry, lots of questions, but it's almost impossible to determine what is wrong with only the code posted... — Mats Petersson, Apr 28 '14 at 22:28
Let me guess: You're running on a pre-Sandy Bridge processor? — Mysticial, Apr 28 '14 at 22:39
`float fmax` must be a bug. It should be `float fmax[4]`; It also needs to be 16 byte aligned if you're going to use `_mm_store_ps`. Do `float __declspec(align(16)) fmax[4]`. — Z boson, Apr 29 '14 at 07:37
I extracted a minimal example from the production code where this happens. When i run this code in release mode with an attached debugger it takes 4000ms to execute, and without debugger 4500ms. My CPU is a AMD Phenom II X6. — Memorex42, Apr 29 '14 at 08:21
@user18298, your code looks fine now. Personally, I would still use explicit loads in your loop instead of explicit loads but I don't know if that makes a difference. — Z boson, Apr 29 '14 at 08:47
@Mysticial, can you please explain what you mean by "Let me guess: You're running on a pre-Sandy Bridge processor?" — Z boson, Apr 30 '14 at 12:01
@Zboson Before the OP posted the code, there was no sign of initialization. In debug, it would be initialized to some something like `0xcccccccc`. In release, it would be heap garbage - would has a high probability of being denormalized. All processors prior to Sandy Bridge suffer huge slowdowns for [denormal floats](http://stackoverflow.com/questions/9314534/why-does-changing-0-1f-to-0-slow-down-performance-by-10x). — Mysticial, Apr 30 '14 at 15:35
@user18298, can you add `_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);` (#include (xmmintrin.h>) to the start of your code? See Mysitical's answer here [denormal floats](http://stackoverflow.com/questions/9314534/why-does-changing-0-1f-to-0-slow-down-performance-by-10x). — Z boson, Apr 30 '14 at 15:39

Z boson · Answer 1 · 2014-04-29T08:14:07.750

1

Your code has a bug. _mm_store_ps stores an array of four floats but you only declare one. The compiler should not even allow you do to that.

Change

float fmax;
_mm_store_ps(fmax, max);
pi = std::max(std::max(std::max(fmax[0], fmax[1]), fmax[2]), fmax[3]);

to

float __declspec(align(16)) fmax[4];
_mm_store_ps(fmax, max);
return std::max(std::max(std::max(fmax[0], fmax[1]), fmax[2]), fmax[3]);

edited Apr 29 '14 at 08:14

answered Apr 29 '14 at 08:03

Z boson

32,619
11
123
226

You are right, i forgot some parts when i extracted the code from our sources. ;) – Memorex42 Apr 29 '14 at 08:24

Visual C++ SSE function slow when starting without debugger

1 Answers1