I have a program that is almost spending all his time computing loops like
for(int j = 0; j < BIGNUMBER; j++)
for(int i = 0; i < SMALLNUMBER; i++)
result += var[i] / sqrt((A[i].x-B[j].x)*(A[i].x-B[j].x)+(A[i].y-B[j].y)*(A[i].y-B[j].y)+(A[i].z-B[j].z)*(A[i].z-B[j].z));
Where 1.0/sqrt(...)
computes the inverse of the norm of the difference between the two vectors A[i] = {A[i].x, A[i].y, A[i].z}
and B[j] = {B[j].x, B[j].y, B[j].z}
which is also the most costly part of the loop.
Is there any way to optimize the loop, even with some precision loss?
Update:
Here the assembly code of an unvectorized loop step with the worse latency of every instruction. You clearly see that the inverse square root is the bottleneck:
movsd A(%rip), %xmm0 1
movsd A+8(%rip), %xmm2 1
subsd B(%rip), %xmm0 3
subsd B+8(%rip), %xmm2 3
movsd A+16(%rip), %xmm1 1
mulsd %xmm0, %xmm0 5
subsd B+16(%rip), %xmm1 3
mulsd %xmm2, %xmm2 5
mulsd %xmm1, %xmm1 5
addsd %xmm2, %xmm0 3
addsd %xmm1, %xmm0 3
movsd .LC0(%rip), %xmm1 1
unpcklpd %xmm0, %xmm0 1
cvtpd2ps %xmm0, %xmm0 4
unpcklps %xmm0, %xmm0 3
cvtps2pd %xmm0, %xmm0 2
sqrtsd %xmm0, %xmm0 58
divsd %xmm0, %xmm1 32
mulsd var(%rip), %xmm1 5
addsd result(%rip), %xmm1 3
cvttsd2si %xmm1, %eax 3
movsd %xmm1, result(%rip) 1
(By the way I don't understand why is it doing unpcklpd cvtpd2ps unpcklps cvtps2pd
.)