2

I am trying to optimize a code in C, specificly a critical loop which takes almost 99.99% of total execution time. Here is that loop:

#pragma omp parallel shared(NTOT,i) num_threads(4)
{
  # pragma omp for private(dx,dy,d,j,V,E,F,G) reduction(+:dU) nowait
  for(j = 1; j <= NTOT; j++){
    if(j == i) continue;
    dx = (X[j][0]-X[i][0])*a;
    dy = (X[j][1]-X[i][1])*a;
    d = sqrt(dx*dx+dy*dy);
    V = (D/(d*d*d))*(dS[0]*spin[2*j-2]+dS[1]*spin[2*j-1]);
    E = dS[0]*dx+dS[1]*dy;
    F = spin[2*j-2]*dx+spin[2*j-1]*dy;
    G = -3*(D/(d*d*d*d*d))*E*F;
    dU += (V+G);
  }
}

All variables are local. The loop takes 0.7 second for NTOT=3600 which is a large amount of time, especially when I have to do this 500,000 times in the whole program, resulting in 97 hours spent in this loop. My question is if there are other things to be optimized in this loop?

My computer's processor is an Intel core i5 with 4 CPU(4X1600Mhz) and 3072K L3 cache.

Celeo
  • 5,583
  • 8
  • 39
  • 41
Rami Zouari
  • 109
  • 1
  • 9
  • As posted, your code has an unbalanced `{`. You have `if(j == i) continue;`. Did you mean `if(j == i) {`? Please edit your post as it's not possible to tell if you want the body of the loop to be executed on `j == i` or `j != i` – Craig Estey Nov 17 '15 at 20:27
  • i want the body of the loop not to be executed for i=j because d will be equal to 0 in this case and i will have floating point exception. and for physical reasons also. – Rami Zouari Nov 17 '15 at 20:31
  • 1
    one way you could optimize this, providing dx and dy are always whole numbers is to build a sqrt array table, as long as dx and dy are not huge numbers, itll take up a little bit of memory, but will provide a lot of speed. – Knight0fDragon Nov 17 '15 at 20:34
  • yes it could be. but they will be huge tabs[NTOT][NTOT] which can not be treated in cache – Rami Zouari Nov 17 '15 at 20:40
  • 2
    Can you post more of your code? (e.g. definitions and usage of all variables). I presume this is wrapped in `for (i = 1; i <= NTOT; ++i)`? Can you prescale `X` by `a` to create `Xa` array? If you post enough code, I can download it and I have a custom benchmark suite that can provide detailed analysis of each line of your code. I already can see some possible optimizations irrespective of openmp. Also, how about some test data or method to generate it. The sqrt looks like a hotspot, but you may be memory bound (e.g. fetch of X and spin arrays takes more time). Measure to know. – Craig Estey Nov 17 '15 at 22:07
  • Optimization will depend heavily on what compiler you're using, so posting that along with the optimization options will help. One obvious optimization would be using fixed-point math instead of floating point though. There could be others depending on the optimizations the compiler already does. – Jason Nov 17 '15 at 22:10
  • Can you afford approximating versions of mathematical functions at the cost of precision? What data types are V/E/F/G/D, vectors? How about all the arrays involved? –  Nov 18 '15 at 05:46
  • There is no way 3600 elements takes nearly a second on your processor. You must have an outer loop over `ì`. You can probably exploit the fact that `dx_i,j = -dx_j,i` to cut your total iterations in half. Since the `spin` only depends on `j` you can operate on strips of `spin` that fit in the cache rather than rereading after each row. You can use SIMD by storing the point in xxxxyyyy format and then calculating the sqrt for four elements at once. You can simplify your power functions as well. – Z boson Nov 18 '15 at 21:05

4 Answers4

3

Optimize for hardware or software?

Soft:

Getting rid of time consuming exceptions such as divide by zeros:

d = sqrt(dx*dx+dy*dy   + 0.001f   );
V = (D/(d*d*d))*(dS[0]*spin[2*j-2]+dS[1]*spin[2*j-1]);

You could also try John Carmack , Terje Mathisen and Gary Tarolli 's "Fast inverse square root" for the

   D/(d*d*d)

part. You get rid of division too.

  float qrsqrt=q_rsqrt(dx*dx+dy*dy + easing);
  qrsqrt=qrsqrt*qrsqrt*qrsqrt * D;

with sacrificing some precision.

There is another division also to be gotten rid of:

(D/(d*d*d*d*d))

such as

 qrsqrt_to_the_power2 * qrsqrt_to_the_power3 * D

Here is the fast inverse sqrt:

float Q_rsqrt( float number )
{
    long i;
    float x2, y;
    const float threehalfs = 1.5F;

    x2 = number * 0.5F;
    y  = number;
    i  = * ( long * ) &y;                       // evil floating point bit level hacking
    i  = 0x5f3759df - ( i >> 1 );               // what ?
    y  = * ( float * ) &i;
    y  = y * ( threehalfs - ( x2 * y * y ) );   // 1st iteration
//  y  = y * ( threehalfs - ( x2 * y * y ) );   // 2nd iteration, this can be removed

    return y;
}

To overcome big arrays' non-caching behaviour, you can do the computation in smaller patches/groups especially when is is many to many O(N*N) algorithm. Such as:

get 256 particles.
compute 256 x 256 relations.
save 256 results on variables.

select another 256 particles as target(saving the first 256 group in place)
do same calculations but this time 1st group vs 2nd group.

save first 256 results again.

move to 3rd group

repeat.

do same until all particles are versused against first 256 particles.

Now get second group of 256. 

iterate until all 256's are complete.

Your CPU has big cache so you can try 32k particles versus 32k particles directly. But L1 may not be big so I would stick with 512 vs 512(or 500 vs 500 to avoid cache line ---> this is going to be dependent on architecture) if I were you.

Hard:

SSE, AVX, GPGPU, FPGA .....

As @harold commented, SSE should be start point to compare and you should vectorize or at least parallelize through 4-packed vector instructions which have advantage of optimum memory fetching ability and pipelining. When you need 3x-10x more performance(on top of SSE version using all cores), you will need an opencl/cuda compliant gpu(equally priced as i5) and opencl(or cuda) api or you can learn opengl too but it seems harder(maybe directx easier).

Trying SSE is easiest, should give 3x faster than the fast inverse I mentionad above. An equally priced gpu should give another 3x of SSE at least for thousands of particles. Going or over 100k particles, whole gpu can achieve 80x performance of a single core of cpu for this type of algorithm when you optimize it enough(making it less dependent to main memory). Opencl gives ability to address cache to save your arrays. So you can use terabytes/s of bandwidth in it.

huseyin tugrul buyukisik
  • 11,469
  • 4
  • 45
  • 97
  • 5
    Why not `_mm_rsqrt_ss` then? This is 2015 after all, SSE is a thing – harold Nov 17 '15 at 22:52
  • Then why not opencl and send it to gpu since it is C99? If this I5 has any Intel HD 4000 + , then it will be faster. – huseyin tugrul buyukisik Nov 17 '15 at 22:53
  • 2
    There could be reasons, and I don't see how C99 is related. But cool as that hack is, `_mm_rsqrt_ss` both more accurate and faster – harold Nov 17 '15 at 22:59
  • thank you all for your response i will look for all this concepts. i am new in optimization. but i think OpenCL and CUDA works only on Nvidia card, is it? Or me i have AMD radeon. anyway i am satisfiyed with the results that i got with OpenMP and if i can acces to a parallel computer i will combine it with MPI, it will be nice. i just wants to do some physics, i am not specialized at computer sciences. but i will try my best to implement your ideas – Rami Zouari Nov 17 '15 at 23:16
  • Whats your gpu? That i5 has igpu? Opencl works on all(including Intel). Cuda works on Nvidia. Amd has plans to convert cuda to AMD-IL and opposite. – huseyin tugrul buyukisik Nov 17 '15 at 23:17
  • really i don't know. i have two card intel HD and AMD radeaon. i didn't found the appropriate driver for AMD working on Ubuntu 12.04. anyway thank you all for your help, i will try this optimisation that you gave me Huseyin then maybe i will think of graphical solutions – Rami Zouari Nov 17 '15 at 23:36
  • You can look at compubench dot com if your devices support opencl in that operating system. – huseyin tugrul buyukisik Nov 17 '15 at 23:39
1

I would always do random pausing to pin down exactly which lines were most costly. Then, after fixing something I would do it again, to find another fix, and so on.

That said, some things look suspicious. People will say the compiler's optimizer should fix these, but I never rely on that if I can help it.

  • X[i], X[j], spin[2*j-1(and 2)] look like candidates for pointers. There is no need to do this index calculation and then hope the optimizer can remove it.

  • You could define a variable d2 = dx*dx+dy*dy and then say d = sqrt(d2). Then wherever you have d*d you can instead write d2.

  • I suspect a lot of samples will land in the sqrt function, so I would try to figure a way around using that.

  • I do wonder if some of these quantities like (dS[0]*spin[2*j-2]+dS[1]*spin[2*j-1]) could be calculated in a separate unrolled loop outside this loop. In some cases two loops can be faster than one if the compiler can save some registers.

Community
  • 1
  • 1
Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
  • Pausing the loop like this one in optimized code would not help because of out-of-order execution. Modern CPU can fit the whole loop body into the reorder buffer, I guess even across several subsequent loop iterations. So it is not so clear where will CPU stop most of the time. – stgatilov Nov 18 '15 at 04:24
  • @stgatilov: That's why I don't do it in optimized code. First I do *my* optimizing. Then I let the compiler do *its* optimizing. It won't be able to do mine, and I won't be able to do its. – Mike Dunlavey Nov 18 '15 at 12:44
1

I cannot believe that 3600 iterations of an O(1) loop can take 0.7 seconds. Perhaps you meant the double loop with 3600 * 3600 iterations? Otherwise I can suggest checking if optimization is enabled, and how long threads spawning takes.

General

Your inner loop is very simple and it contains only a few operations. Note that divisions and square roots are roughly 15-30 times slower than additions, subtractions and multiplications. You are doing three of them, so most of the time is eaten by them.

First of all, you can compute reciprocal square root in one operation instead of computing square root, then getting reciprocal of it. Second, you should save the result and reuse it when necessary (right now you divide by d twice). This would result in one problematic operation per iteration instead of three.

invD = rsqrt(dx*dx+dy*dy);
V = (D * (invD*invD*invD))*(...);
...
G = -3*(D * (invD*invD*invD*invD*invD))*E*F;
dU += (V+G);

In order to further reduce time taken by rsqrt, I advise vectorizing it. I mean: compute rsqrt for two or four input values at once with SSE. Depending on size of your arguments and desired precision of result, you can take one of the routines from this question. Note that it contains a link to a small GitHub project with all the implementations.

Indeed you can go further and vectorize the whole loop with SSE (or even AVX), that is not hard.

OpenCL

If you are ready to use some big framework, then I suggest using OpenCL. Your loop is very simple, so you won't have any problems porting it to OpenCL (except for some initial adaptation to OpenCL).

Then you can use CPU implementations of OpenCL, e.g. from Intel or AMD. Both of them would automatically use multithreading. Also, they are likely to automatically vectorize your loop (e.g. see this article). Finally, there is a chance that they would find a good implementation of rsqrt automatically, if you use native_rsqrt function or something like that.

Also, you would be able to run your code on GPU. If you use single precision, it may result in significant speedup. If you use double precision, then it is not so clear: modern consumer GPUs are often slow with double precision, because they lack the necessary hardware.

Community
  • 1
  • 1
stgatilov
  • 5,333
  • 31
  • 54
  • thank you for your response. There is new concepts for me that i didn't know before so i have to do some researches to understand them then i will decide what to do. thank you all for your responses it was so helpful for me – Rami Zouari Nov 18 '15 at 12:46
0

Minor optimisations:

  • (d * d * d) is calculated twice. Store d*d and use it for d^3 and d^5
  • Modify 2 * x by x<<1;
thepace
  • 2,221
  • 1
  • 13
  • 21
  • 1
    I think replacing `2*x` by `x<<1` would make the code harder to read, but **not** faster. Compilers are not that stupid these days. – stgatilov Nov 18 '15 at 10:17