I got this class,
Approach 1:
typedef float v4sf __attribute__ (vector_size(16))
class Unit
{
public:
Unit(int num)
{
u = new float[num];
v = new float[num];
}
void update()
{
for(int i =0 ; i < num; i+=4)
{
*(v4sf*)&u[i] = *(v4sf*)&v[i] + *(v4sf*)&t[i];
//many other equations
}
}
float*u,*v,*t; //and many other variables
}
Approach 2:
Same as approach 1. Except that in approach 2, v
,u
, and all other variables are allocated on a big chunk pre-allocated on heap, using placement new
.
typedef float v4sf __attribute__ (vector_size(16))
class Unit
{
public:
Unit(int num)
{
buffer = new char[num*sizeof(*u) + sizeof(*v) /*..and so on for other variables..*/]
u = new(buffer) float[num];
v = new(buffer+sizeof(float)*num) float[num];
//And so on for other variables
}
void update()
{
for(int i =0 ; i < num; i+=4)
{
*(v4sf*)&u[i] = *(v4sf*)&v[i] + *(v4sf*)&t[i];
//many other equations
}
}
char* buffer;
float*u,*v,*t; //and many other variables
}
However, approach 2 is 2x faster. Why is that?
There are around 12 float variables and num is 500K. update() is called 1k
times. The speed doesnt factor in the memory allocation. I measure the speed like this:
double start = getTime();
for( int i = 0; i < 1000; i++)
{
unit->update();
}
double end = getTime();
cout<<end - start;
And this is around 2x faster in approach 2.
Compiler options: gcc -msse4 -o3 -ftree-vectorize.
L1 cache is 256K, Ram is 8GB, pagesize is 4K.
Edit: Corrected the mistake in allocating the variables in approach 2. All variables are allocated in different sections, correctly. Processor is Intel(R) Core(TM) i7-2600 CPU @ 3.40GHz
Edit: added the source here - Source. Approach 1) gives 69.58s , Approach 2) gives 46.74s. Though not 2x faster, it is still fast.