2

I have a program that is almost spending all his time computing loops like

for(int j = 0; j < BIGNUMBER; j++)
    for(int i = 0; i < SMALLNUMBER; i++)
        result += var[i] / sqrt((A[i].x-B[j].x)*(A[i].x-B[j].x)+(A[i].y-B[j].y)*(A[i].y-B[j].y)+(A[i].z-B[j].z)*(A[i].z-B[j].z));

Where 1.0/sqrt(...) computes the inverse of the norm of the difference between the two vectors A[i] = {A[i].x, A[i].y, A[i].z} and B[j] = {B[j].x, B[j].y, B[j].z} which is also the most costly part of the loop.

Is there any way to optimize the loop, even with some precision loss?


Update:

Here the assembly code of an unvectorized loop step with the worse latency of every instruction. You clearly see that the inverse square root is the bottleneck:

movsd   A(%rip), %xmm0      1
movsd   A+8(%rip), %xmm2    1
subsd   B(%rip), %xmm0      3
subsd   B+8(%rip), %xmm2    3
movsd   A+16(%rip), %xmm1   1
mulsd   %xmm0, %xmm0        5
subsd   B+16(%rip), %xmm1   3
mulsd   %xmm2, %xmm2        5
mulsd   %xmm1, %xmm1        5
addsd   %xmm2, %xmm0        3
addsd   %xmm1, %xmm0        3
movsd   .LC0(%rip), %xmm1   1
unpcklpd    %xmm0, %xmm0    1
cvtpd2ps    %xmm0, %xmm0    4
unpcklps    %xmm0, %xmm0    3
cvtps2pd    %xmm0, %xmm0    2
sqrtsd  %xmm0, %xmm0        58
divsd   %xmm0, %xmm1        32
mulsd   var(%rip), %xmm1    5
addsd   result(%rip), %xmm1 3
cvttsd2si   %xmm1, %eax     3
movsd   %xmm1, result(%rip) 1

(By the way I don't understand why is it doing unpcklpd cvtpd2ps unpcklps cvtps2pd.)

Kyle_the_hacker
  • 1,394
  • 1
  • 11
  • 27
  • Your question should be moved to http://math.stackexchange.com – fluminis May 06 '14 at 11:05
  • @fluminis, are you shure? – Kyle_the_hacker May 06 '14 at 12:03
  • Are you sure `1./sqrt(x)` is the bottleneck? If so, can you avoid it? (The rest of the computation can be sped up by using some linear algebra, but that part cannot.) – Fred Foo May 06 '14 at 12:10
  • OK, assuming I can add those latencies, they are 146 cycles, so 33% is in `sqrtsd`, 22% is in `divsd`, and 45% is in everything else, so there might still be room for improvement outside of the `sqrtsd`. The trouble with the word "bottleneck" is it inclines you to look in just one spot, and possibly overlook bigger stuff just because it happens to be diffuse. – Mike Dunlavey May 06 '14 at 13:49
  • It's hard to tell what your data types are, but from the generated code it looks like you're mixing float and double. If you can restrict all your data, calculations and function calls to single precision float then you should get better performance. – Paul R May 06 '14 at 14:20
  • @PaulR all is double precision. I don't understand the `cvtpd2ps` and `cvtps2pd` neither. – Kyle_the_hacker May 06 '14 at 14:23

5 Answers5

3

It is tempting to assume sqrt is the "bottleneck", but if you do this you can find out for sure. It could well be that

  (A[i].x-B[j].x)*(A[i].x-B[j].x)
+ (A[i].y-B[j].y)*(A[i].y-B[j].y)
+ (A[i].z-B[j].z)*(A[i].z-B[j].z)

is the bottleneck instead.

If you are depending on the compiler to optimize all those indexing operations, it might, but I like to be sure. As a first cut, I would write this in the inner loop:

// where Foo is the class of A[i] and B[j]
Foo& a = A[i];
Foo& b = B[j];
double dx = a.x - b.x, dy = a.y - b.y, dz = a.z - b.z;
result += var[i] / sqrt( dx*dx + dy*dy + dz*dz );

which is much easier for the compiler to optimize. Then if indexing is still a bottleneck, I would lift Foo& b = B[j] out of the inner loop, and step a pointer rather than writing A[i]. If it gets to the point where samples show that the inner for statement itself (testing for termination and incrementing) takes a noticeable fraction of time, I would unroll it some.

Community
  • 1
  • 1
Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
3

If you can arrange your vectors in AoSoA form (xxyyzzxxyyzzxxyyzz...) you can do this very efficiently with SSE or AVX (xxxxyyyyzzzz...). In the code below I assumed SSE2 which has vec_size=2 but it's easy to change this to AVX. But your code is likely memory bound and not compute bound so this is only going to be useful for small loops which fit in the L1 cache. It will also be faster to use single float since it does twice the number of flops and sqrt is one of the few functions which is actually slower for double than float.

resultv = _mm_setzero_pd(0);
for(int j = 0; j < BIGNUMBER; j+=vec_size) {
    bx = _mm_load_pd(&B[3*j+0*vec_size]);
    by = _mm_load_pd(&B[3*j+1*vec_size]);
    bz = _mm_load_pd(&B[3*j+2*vec_size]);
    for(int i = 0; i < SMALLNUMBER; i+=vec_size) {
        ax = _mm_load_pd(&A[3*i+0*vec_size]);
        ay = _mm_load_pd(&A[3*i+1*vec_size]);
        az = _mm_load_pd(&A[3*i+2*vec_size]);
        dx = _mm_sub_pd(ax,bx);
        dy = _mm_sub_pd(ay,by);
        dz = _mm_sub_pd(az,bz);
        mag2 = _mm_add_pd(_mm_add_pd(_mm_mul_pd(dx,dx),_mm_mul_pd(dy,dy)), _mm_mul_pd(dz,dz));
        varv = _mm_load_pd(&var[i]);        
        resultv = _mm_add_pd(_mm_div_pd(varv, _mm_sqrt_pd(mag2)), resultv);
        //resultv = _mm_add_pd(_mm_mul_pd(varv, _mm_rsqrt_pd(mag2)), resultv);
    }
}
result = _mm_cvtsd_f64(_mm_hadd_pd(resultv,resultv));
Z boson
  • 32,619
  • 11
  • 123
  • 226
  • The target CPU unfortunately don't have AVX. – Kyle_the_hacker May 06 '14 at 15:32
  • @Kyle_the_hacker, that's okay, the code I posted was for SSE2. The main advantage of my code compared to yours is that your code does the square root every iteration. My code only does it per vec_size. If you used float with SSE2 that would mean once every four iterations. – Z boson May 07 '14 at 08:40
  • Yes, I understood. It was only a comment. ;) I will try to implement AoSoA in my program. Thanks. – Kyle_the_hacker May 07 '14 at 10:11
  • 1
    @Kyle_the_hacker, it should be easy to convert the code I posted to float. Change all the `pd` extentions to `ps` and set vec_size=4. The last line `result =` will have to be done differently though. – Z boson May 07 '14 at 11:19
  • Perhaps iterating by 'i' in outer loop is more efficient (var[i] can be loaded once, inner loop is longer). Another possibility for optimization is to use fast but inprecise _mm_rsqrt_ps instruction, then add one or two Newton-Raphson iterations (depends on precision required). – stgatilov Jul 17 '15 at 12:46
2

In a comment you said the bottleneck is still the computation of the inverse square root. Luckily for you, that is something that comes up a lot in graphics, and there is a very fancy algorithm to do it. There is a wikipedia article about it, an implementation in quake and a SO question 3.

Community
  • 1
  • 1
Davidmh
  • 3,797
  • 18
  • 35
  • I'm aware of the Quake implementation. On actual CPUs it is as fast as the normal 1.0f/sqrt(...) function. It is not an optimization any more. – Kyle_the_hacker May 06 '14 at 12:08
  • Then your compiler is computing sqrt using SIMD. If the bottleneck is the square root, I don't think you get much faster than this. Perhaps Yeppp! (http://www.yeppp.info/benchmarks.html) or MKL can help with faster implementations, but don't expect a big change. – Davidmh May 06 '14 at 12:11
  • It looks like Yeppp! don't have any implementation of the square root. I don't have access to MKL. – Kyle_the_hacker May 06 '14 at 13:23
  • 2
    "The algorithm generates reasonably accurate results using a unique first approximation for Newton's method; however, it is much slower and less accurate than using the SSE instruction rsqrtss on x86 processors also released in 1999." – phuclv May 06 '14 at 14:24
1

This is good workplace for SIMD (SSE) instructions. If you compiler supports auto vectorization, get this option on (tuning data structures layout for real gain). If it supports intrinsics, you may use them.

Edit
Assembler code above doesn't exploit vectorization. It uses scalar SSE instructions. You may try to help compiler a bit - make structures of (X,Y,Z,Dummy) of float, not double. SSE vector instructions can treat 4 floats simultaneously (or 2 doubles).
(I think that some math libraries already contain SIMD functions for vector norm)

P.S. You may add SSE tag to question

BenMorel
  • 34,448
  • 50
  • 182
  • 322
MBo
  • 77,366
  • 5
  • 53
  • 86
  • It already does vectorization, but the bottleneck is still the computation of the inverse square root. – Kyle_the_hacker May 06 '14 at 11:56
  • Does generated assembly code contain SSE RCPSS/RSQRTSS (or packed variants) instructions or x87 SQRT? – MBo May 06 '14 at 12:42
  • I already tried to compute `temp = (A[i].x-B[j].x)*(A[i].x-B[j].x)+(A[i].y-B[j].y)*(A[i].y-B[j].y)+(A[i].z-B[j].z)*(A[i].z-B[j].z)` then `invSqrt = _mm_rsqrt_ss(_mm_load_ss(&temp))[0]` this broke the auto-vectorization, extended the asm code and was only a tiny bit faster in the end. – Kyle_the_hacker May 06 '14 at 13:36
  • @Kyle_the_hacker try doing in multiple phase may help it easier to vectorize. That means calculating the denominator and store in a seperate array first, then do the inverse square root on that array in another for loop, and multiply val[i] with the inverse square root value – phuclv May 06 '14 at 14:31
0

By forcing the calculation in single-precision through (1.0f/sqrt(float(...))) and using #pragma GCC optimize ("Ofast") for the function, I was able to get the rsqrtss instruction, with a nice speedup (around 2 times faster) on the whole function. It actually break the auto-vectorization (probably because of the mix of single and double precision).

Assembly code:

movsd   56(%rax), %xmm0     
addq    $120, %rax
movsd   -72(%rax), %xmm2    
subsd   %xmm5, %xmm0
movsd   -56(%rax), %xmm1    
subsd   %xmm4, %xmm2
subsd   %xmm6, %xmm1
mulsd   %xmm0, %xmm0
mulsd   %xmm2, %xmm2
mulsd   %xmm1, %xmm1
addsd   %xmm2, %xmm0
addsd   %xmm1, %xmm0
unpcklpd    %xmm0, %xmm0
cvtpd2ps    %xmm0, %xmm0
rsqrtss %xmm0, %xmm1
mulss   %xmm1, %xmm0
mulss   %xmm1, %xmm0
mulss   %xmm7, %xmm1
addss   %xmm8, %xmm0
mulss   %xmm1, %xmm0
mulss   -40(%rax), %xmm0
cmpq    %rdx, %rax
unpcklps    %xmm0, %xmm0
cvtps2pd    %xmm0, %xmm0
addsd   %xmm0, %xmm3

But I don't understand the additional multiplications at the end.

Kyle_the_hacker
  • 1,394
  • 1
  • 11
  • 27
  • 1
    Make sure `rsqrtss` is precsise enough for you. I have used it in the past and it's results were [too coarse in my case](https://stackoverflow.com/questions/17110227/flt-epsilon-for-a-nth-root-finder-with-sse-avx) . You may need to do one iteration of newtons method with it. If it's good enough maybe you should consider using single floating point in general. – Z boson May 07 '14 at 08:36
  • Yes, I know it use as lookup table and that it is very approximate. I will look if a Newton step will make significant changes to the results, but it should be ok so. The other problem is: `A` and `B` come from other part of the program and I don't have much control over it; I have to check if converting them on the flight to single precision AoSoA will help improving the speed. – Kyle_the_hacker May 07 '14 at 10:25