2

I got this class,

Approach 1:

typedef float v4sf __attribute__ (vector_size(16))

class Unit
{
    public:
    Unit(int num)
    {
        u = new float[num];
        v = new float[num];
    }
    void update()
    {
        for(int i =0 ; i < num; i+=4)
        {
            *(v4sf*)&u[i] = *(v4sf*)&v[i] + *(v4sf*)&t[i];
            //many other equations
        }
    }
    float*u,*v,*t; //and many other variables
}

Approach 2:

Same as approach 1. Except that in approach 2, v,u, and all other variables are allocated on a big chunk pre-allocated on heap, using placement new.

typedef float v4sf __attribute__ (vector_size(16))

class Unit
{
    public:
    Unit(int num)
    {
        buffer = new char[num*sizeof(*u) + sizeof(*v)  /*..and so on for other variables..*/]
        u = new(buffer) float[num];
        v = new(buffer+sizeof(float)*num) float[num];
        //And so on for other variables
    }
    void update()
    {
        for(int i =0 ; i < num; i+=4)
        {
            *(v4sf*)&u[i] = *(v4sf*)&v[i] + *(v4sf*)&t[i];
            //many other equations
        }
    }
    char* buffer;
    float*u,*v,*t; //and many other variables
}

However, approach 2 is 2x faster. Why is that?

There are around 12 float variables and num is 500K. update() is called 1k times. The speed doesnt factor in the memory allocation. I measure the speed like this:

double start = getTime();
for( int i = 0; i < 1000; i++)
{
   unit->update();
}
double end = getTime();
cout<<end - start;

And this is around 2x faster in approach 2.

Compiler options: gcc -msse4 -o3 -ftree-vectorize.

L1 cache is 256K, Ram is 8GB, pagesize is 4K.

Edit: Corrected the mistake in allocating the variables in approach 2. All variables are allocated in different sections, correctly. Processor is Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz

Edit: added the source here - Source. Approach 1) gives 69.58s , Approach 2) gives 46.74s. Though not 2x faster, it is still fast.

excray
  • 2,738
  • 4
  • 33
  • 49
  • 2
    Can you show actual code for approach 2? – R. Martinho Fernandes Jul 24 '12 at 18:21
  • What is the processor? I have a strong feeling that you may be running into a similar situation as: [Why is one loop so much slower than two loops?](http://stackoverflow.com/questions/8547778/why-is-one-loop-so-much-slower-than-two-loops) – Mysticial Jul 24 '12 at 18:25
  • 1
    u and v are the same now. I think you should use v = new(buffer + sizeof(float)*num) – holgac Jul 24 '12 at 18:25
  • 1
    It would be interesting to benchmark approach 3: using `std::vector` – aschepler Jul 24 '12 at 18:28
  • What is `num`? Is it large? On the order of 50k - 100k? – Mysticial Jul 24 '12 at 18:32
  • Well, does your updated code show the same performance difference? You had them sharing memory locations before. – Mysticial Jul 24 '12 at 18:38
  • yea, the same performance difference. I was wrong in typing out the code here. The code i measured the performance difference on, doesnt share memory. – excray Jul 24 '12 at 18:42

4 Answers4

4

Probably because 'Approach 2' has a bug -- all variables u, v, t are places in exactly the same place in memory (you are passing same address to placement new).

Edit: and now you don't... ;)

It's hard to guess without profiling, but it might be related to default allocator. If in first approach you have separate calls to new for each variable there is no guarantee that those variables will be assigned addresses which are close to each other. On the other hand, in second approach you making sure they are as close as possible to each other. This will maximize cache utilization and limit cache misses.

tumdum
  • 1,981
  • 14
  • 19
  • In particular, data cache hit/miss ratio is probably a lot higher. – jxh Jul 24 '12 at 18:29
  • sorry, mistake in code. all variables are in different and correct locations. – excray Jul 24 '12 at 18:31
  • 1
    @VivekG does that mean a mistake when copying and pasting (that's what you did, right? You're not making us chase around problems in the wrong code, are you?) or a mistake in the original code, which now yields different results? – R. Martinho Fernandes Jul 24 '12 at 18:32
  • No sorry, the problem is real. I didn't copy paste the code as it is proprietary. I made the mistake when i typed the code here. – excray Jul 24 '12 at 18:38
  • 3
    What @RMF is saying is: Please don't post code "similar" to something you observed and claim it will have some behavior without the extra step of trying the simpler posted code out yourself. Especially important when talking about performance! – aschepler Jul 24 '12 at 18:43
  • you are right. I thought there would be a quick answer for this. Added the source. – excray Jul 24 '12 at 19:31
1

It would be useful to break down the timing and see what part is in the constructor versus what part is in update.

Since update didn't change, the only thing that would affect the timing of it is cache effects on the data. This is more than capable of accounting for a 2X difference.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
0

Normal new is actually allocation + construction while placement new is just construction.
So naturally, allocation + 2 construction is faster than allocation + construction + allocation + construction.
Moreover, construction of integral type is nop so in your case it's 2 allocations vs 1 allocation.

Daniel
  • 30,896
  • 18
  • 85
  • 139
0

I assume in approach 2 the compiler was able to recognize that the addresses of u an v will not change between calls, and therefore keep some of the pointers used in the equations in the for loop in registers.

timos
  • 2,637
  • 18
  • 21