Optimize for hardware or software?
Soft:
Getting rid of time consuming exceptions such as divide by zeros:
d = sqrt(dx*dx+dy*dy + 0.001f );
V = (D/(d*d*d))*(dS[0]*spin[2*j-2]+dS[1]*spin[2*j-1]);
You could also try John Carmack , Terje Mathisen and Gary Tarolli 's "Fast inverse square root" for the
D/(d*d*d)
part. You get rid of division too.
float qrsqrt=q_rsqrt(dx*dx+dy*dy + easing);
qrsqrt=qrsqrt*qrsqrt*qrsqrt * D;
with sacrificing some precision.
There is another division also to be gotten rid of:
(D/(d*d*d*d*d))
such as
qrsqrt_to_the_power2 * qrsqrt_to_the_power3 * D
Here is the fast inverse sqrt:
float Q_rsqrt( float number )
{
long i;
float x2, y;
const float threehalfs = 1.5F;
x2 = number * 0.5F;
y = number;
i = * ( long * ) &y; // evil floating point bit level hacking
i = 0x5f3759df - ( i >> 1 ); // what ?
y = * ( float * ) &i;
y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
// y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed
return y;
}
To overcome big arrays' non-caching behaviour, you can do the computation in smaller patches/groups especially when is is many to many O(N*N) algorithm. Such as:
get 256 particles.
compute 256 x 256 relations.
save 256 results on variables.
select another 256 particles as target(saving the first 256 group in place)
do same calculations but this time 1st group vs 2nd group.
save first 256 results again.
move to 3rd group
repeat.
do same until all particles are versused against first 256 particles.
Now get second group of 256.
iterate until all 256's are complete.
Your CPU has big cache so you can try 32k particles versus 32k particles directly. But L1 may not be big so I would stick with 512 vs 512(or 500 vs 500 to avoid cache line ---> this is going to be dependent on architecture) if I were you.
Hard:
SSE, AVX, GPGPU, FPGA .....
As @harold commented, SSE should be start point to compare and you should vectorize or at least parallelize through 4-packed vector instructions which have advantage of optimum memory fetching ability and pipelining. When you need 3x-10x more performance(on top of SSE version using all cores), you will need an opencl/cuda compliant gpu(equally priced as i5) and opencl(or cuda) api or you can learn opengl too but it seems harder(maybe directx easier).
Trying SSE is easiest, should give 3x faster than the fast inverse I mentionad above. An equally priced gpu should give another 3x of SSE at least for thousands of particles. Going or over 100k particles, whole gpu can achieve 80x performance of a single core of cpu for this type of algorithm when you optimize it enough(making it less dependent to main memory). Opencl gives ability to address cache to save your arrays. So you can use terabytes/s of bandwidth in it.