2

i have a pretty weird problem regarding SSE usage.

I wrote the following function where i use SSE to calculate the maximum of the difference of two float arrays, each containing 64 floats.

The dists-array is a 2d-array allocated via _aligned_malloc.

#include <iostream>
#include <xmmintrin.h>
#include <time.h>
#include <stdio.h>
#include <algorithm>
#include <fstream>

#include "hr_time.h"

using namespace std;

float** dists;
float** dists2;
__m128* a;
__m128* b;
__m128* c;
__m128* d;
__m128 diff;
__m128 diff2;
__m128 mymax;
float* myfmax;

float test(int s, int t)
{
    a = (__m128*) dists[s];
    b = (__m128*) dists[t];
    c = (__m128*) dists2[s];
    d = (__m128*) dists2[t];

    diff;
    mymax = _mm_set_ps(0.0, 0.0, 0.0, 0.0);
    for (int i = 0; i <= 16; i++)
    {
        diff = _mm_sub_ps(*a, *b);
        mymax = _mm_max_ps(diff, mymax);

        diff2 = _mm_sub_ps(*d, *c);
        mymax = _mm_max_ps(diff2, mymax);

        a++;
        b++;
        c++;
        d++;
    }

    _mm_store_ps(myfmax, mymax);
    float res = max(max(max(myfmax[0], myfmax[1]), myfmax[2]), myfmax[3]);
    return res;
}

int Deserialize(std::istream* stream)
{
    int numOfElements, arraySize;

    stream->read((char*)&numOfElements, sizeof(int)); // numOfElements = 64
    stream->read((char*)&arraySize, sizeof(int)); // arraySize = 8000000 

    dists = (float**)_aligned_malloc(arraySize * sizeof(float*), 16);
    dists2 = (float**)_aligned_malloc(arraySize * sizeof(float*), 16);
    for (int j = 0; j < arraySize; j++)
    {
        dists[j] = (float*)_aligned_malloc(numOfElements * sizeof(float), 16);
        dists2[j] = (float*)_aligned_malloc(numOfElements * sizeof(float), 16);
    }

    for (int i = 0; i < arraySize; i++)
    {
        stream->read((char*)dists[i], (numOfElements*sizeof(float)));
    }

    for (int i = 0; i < arraySize; i++)
    {
        stream->read((char*)dists2[i], (numOfElements*sizeof(float)));
    }

    return 0;
}

int main(int argc, char** argv)
{
    int entries = 8000000;

    myfmax = (float*)_aligned_malloc(4 * sizeof(float), 16);
    ifstream fs("binary_file", std::ios::binary);
    Deserialize(&fs);

    CStopWatch* watch = new CStopWatch();
    watch->StartTimer();
    int i;
    for (i = 0; i < entries; i++)
    {
        int s = rand() % entries;
        int t = rand() % entries;
        test(s, t);
    }
    watch->StopTimer();
    cout << i << " iterations took " << watch->GetElapsedTimeMs() << "ms" << endl;

    cin.get();
}

My problem is, that this code runs very fast if i run it in Visual Studio with an attached debugger. But as soon as i execute it without the debugger it gets very slow. So i did a little reasearch and found out that one difference between those two starting methods is the "Debug Heap". So i disabled that by defining "_NO_DEBUG_HEAP=1". With that option i get very poor performance with an attached debugger too.

But i don't understand how i can get better performance by using the Debug Heap? And i don't know how to solve this problem, so i hope one of you guys can help me.

Thanks in advance.

Regards, Karsten

Memorex42
  • 131
  • 1
  • 9
  • Hmm, weird! Can we see your allocation code? – Cameron Apr 28 '14 at 22:08
  • 1
    Are you compiling for release or debug? How large are the allocations? What is "fast" and "very slow" - how are you measuring this, and over what code? What is `diff;` in the middle of your code? Sorry, lots of questions, but it's almost impossible to determine what is wrong with only the code posted... – Mats Petersson Apr 28 '14 at 22:28
  • Let me guess: You're running on a pre-Sandy Bridge processor? – Mysticial Apr 28 '14 at 22:39
  • Can you try using explicit loads using `_mm_load_ps`. – Z boson Apr 29 '14 at 07:34
  • `float fmax` must be a bug. It should be `float fmax[4]`; It also needs to be 16 byte aligned if you're going to use `_mm_store_ps`. Do `float __declspec(align(16)) fmax[4]`. – Z boson Apr 29 '14 at 07:37
  • I extracted a minimal example from the production code where this happens. When i run this code in release mode with an attached debugger it takes 4000ms to execute, and without debugger 4500ms. My CPU is a AMD Phenom II X6. – Memorex42 Apr 29 '14 at 08:21
  • @user18298, your code looks fine now. Personally, I would still use explicit loads in your loop instead of explicit loads but I don't know if that makes a difference. – Z boson Apr 29 '14 at 08:47
  • 1
    @Mysticial, can you please explain what you mean by "Let me guess: You're running on a pre-Sandy Bridge processor?" – Z boson Apr 30 '14 at 12:01
  • 1
    @Zboson Before the OP posted the code, there was no sign of initialization. In debug, it would be initialized to some something like `0xcccccccc`. In release, it would be heap garbage - would has a high probability of being denormalized. All processors prior to Sandy Bridge suffer huge slowdowns for [denormal floats](http://stackoverflow.com/questions/9314534/why-does-changing-0-1f-to-0-slow-down-performance-by-10x). – Mysticial Apr 30 '14 at 15:35
  • @user18298, can you add `_MM_SET_FLUSH_ZERO_MODE(_MM_FLUSH_ZERO_ON);` (#include (xmmintrin.h>) to the start of your code? See Mysitical's answer here [denormal floats](http://stackoverflow.com/questions/9314534/why-does-changing-0-1f-to-0-slow-down-performance-by-10x). – Z boson Apr 30 '14 at 15:39

1 Answers1

1

Your code has a bug. _mm_store_ps stores an array of four floats but you only declare one. The compiler should not even allow you do to that.

Change

float fmax;
_mm_store_ps(fmax, max);
pi = std::max(std::max(std::max(fmax[0], fmax[1]), fmax[2]), fmax[3]);

to

float __declspec(align(16)) fmax[4];
_mm_store_ps(fmax, max);
return std::max(std::max(std::max(fmax[0], fmax[1]), fmax[2]), fmax[3]);
Z boson
  • 32,619
  • 11
  • 123
  • 226