4

In actual fact, its the derivative of the Lennard Jones potential. The reason for is that I am writing a Molecular Dynamics program and at least 80% of the time is spent in the following function, even with the most aggressive compiler options (gcc ** -O3).

double ljd(double r) /* Derivative of Lennard Jones Potential for Argon with 
                        respect to distance (r) */ 
{  
    double temp;  
    temp = Si/r;  
    temp = temp*temp;
    temp = temp*temp*temp;  
    return ( (24*Ep/r)*(temp-(2 * pow(temp,2))) );  
}  

This code is from a file "functs.h", which I import into my main file. I thought that using temporary variables in this way would make the function faster, but I am worried that creating them is too wasteful. Should I use static? Also the code is written in parallel using openmp, so I can't really declare temp as a global variable?

The variables Ep and Si are defined (using #define). I have only been using C for about 1 month. I tried to look at the assembler code generated by gcc, but I was completely lost.\

sn6uv
  • 599
  • 5
  • 19
  • You should also specify the processor you're using as different processors handle floating point in different ways. It's probably an Intel one you're using though. – Skizz Feb 07 '11 at 15:15
  • 2
    introduction of useless temporary variables will most-likely result in no performance gain and only hurt readability - a decent compiler should detect repeated use of values on its own... – Christoph Feb 07 '11 at 15:18
  • I am using an Intel core 2 duo processor – sn6uv Feb 07 '11 at 15:21
  • @Christoph because I need to calculate (Si/r)^7 and (Si/r)^13 isn't it better to minimise the number of multiplications? – sn6uv Feb 07 '11 at 15:25
  • Hmmm... Premature optimization is the source of...? – rubenvb Feb 07 '11 at 15:36
  • @rubenvb: At least Angus has made some measurements! – Skizz Feb 07 '11 at 15:53
  • @rubenvb I don't think that this is premature, as the program does what I originally intended it to, however it runs quite slowly, and according to gprof, this section of code is where the program spends most of its time. – sn6uv Feb 07 '11 at 15:54
  • @Skizz: true, must've missed that `:s`. @angus: my apologies. – rubenvb Feb 07 '11 at 15:55
  • 1
    I hope you're caching the results of the function intelligently, no need to do all that many times for the same value of r! – Skizz Feb 07 '11 at 15:57
  • @Angus Remember that sometimes the _truncated_ potential may save you _hours_ – Dr. belisarius Feb 07 '11 at 15:57
  • @belisarius yes your right, however I have already implemented it somewhere else in the code. I probably should have mentioned that, but I wanted to keep the question clear/simple. – sn6uv Feb 07 '11 at 16:43
  • Others have gone over the main points (ie, use `-ffast-math`), but to clarify a misconception - static/global variables are almost always _slower_ than stack variables. This is because they are visible to other threads and/or functions (if global) and this may force the compiler to spill them to memory when not needed. On most "normal" platforms, making a local variable static/global/heap will _never_ give the compiler more freedom to optimize for CPU. – bdonlan Feb 07 '11 at 16:56

6 Answers6

7

I would get rid of the call to pow() for a start:

double ljd(double r) /* Derivative of Lennard Jones Potential for Argon with 
                        respect to distance (r) */ 
{  
    double temp;  
    temp = Si / r;  
    temp = temp * temp;
    temp = temp * temp * temp;  
    return ( (24.0 * Ep / r) * (temp - (2.0 * temp * temp)) );  
}  
Paul R
  • 208,748
  • 37
  • 389
  • 560
  • I'd really be interested in how much speedup you get with just removing the call to `pow`. My guess is that it should be pretty impressive, but I don't like guessing performance aspects. – Alexandre C. Feb 07 '11 at 16:10
  • I saw about a 20% increase when removing the call to pow() – sn6uv Feb 07 '11 at 16:24
  • I suspect the speedup is mainly due to the fact that `pow()` has to do error-checking, which can be disabled... – Christoph Feb 07 '11 at 16:31
  • could it be that pow( ) treats 2 as a float rather than int? – sn6uv Feb 07 '11 at 16:35
  • @Angus: that may well be the case; the underlying problem, however, is the one I hinted at above: if you don't enable `-ffast-math`, the compiler will generate a call to the generic `pow()` implementation (which provides all the fancy guarantees made by the C standard) instead of inlining an optimized version; generic `pow()` might or might not special-case integer powers, but if the former, it would have to do so at runtime! – Christoph Feb 07 '11 at 16:43
3

On my architecture (intel Centrino Duo, MinGW-gcc 4.5.2 on Windows XP), non-optimized code using pow()

static inline double ljd(double r)
{
    return 24 * Ep / Si * (pow(Si / r, 7) - 2 * pow(Si / r, 13));
}

actually outperforms your version if -ffast-math is provided.

The generated assembly (using some arbitrary values for Ep and Si) looks like this:

fldl    LC0
fdivl   8(%ebp)
fld     %st(0)
fmul    %st(1), %st
fmul    %st, %st(1)
fld     %st(0)
fmul    %st(1), %st
fmul    %st(2), %st
fxch    %st(1)
fmul    %st(2), %st
fmul    %st(0), %st
fmulp   %st, %st(2)
fxch    %st(1)
fadd    %st(0), %st
fsubrp  %st, %st(1)
fmull   LC1
Christoph
  • 164,997
  • 36
  • 182
  • 240
  • I hadn't heard of that before. It is provided with my version of gcc 4.4.5 (Linux), and it does seem to provide a fair speed increase. I haven't tried measuring it properly yet. Thanks. – sn6uv Feb 07 '11 at 16:15
  • I ran a simple test. I got 12s when using the version I originally posted and passing, It sped up to 9s when passing -ffast-math to gcc, using the version above and -ffast-math gives a time of 8s – sn6uv Feb 07 '11 at 16:20
  • @Agnus: most math functions are compiler-intrinsic and can be efficiently optimized, but won't be by default because of error-handling and other floating-point related guarantees specified by the C standard; be sure to compare the results with your reference-implementation to make sure that using 'unsafe' operations won't impact program correctness... – Christoph Feb 07 '11 at 16:23
  • @Angus, @Christoph: although -ffast-math is impressive, you must be careful with it as it may produce incorrect code since it ignores various C language guides. For example: (Si/r)^7 would produce a different result to Si^7/r^7. For more info see http://msdn.microsoft.com/en-us/library/aa289157.aspx (although this is MS specific, there will be similar arguments for the gcc compiler) – Skizz Feb 10 '11 at 11:07
1

Well, as I've said before, compilers suck at optimising floating point code for many reasons. So, here's an Intel assembly version that should be faster (compiled using DevStudio 2005):

const double Si6 = /*whatever pow(Si,6) is*/;
const double Si_value = /*whatever Si is*/; /* need _value as Si is a register name! */
const double Ep24 = /*whatever 24.Ep is*/;

double ljd (double r)
{
  double result;
  __asm
  {
    fld qword ptr [r]
    fld st(0)
    fmul st(0),st(0)
    fld st(0)
    fmul st(0),st(0)
    fmulp st(1),st(0)
    fld qword ptr [Si6]
    fdivrp st(1),st(0)
    fld st(0)
    fld1
    fsub st(0),st(1)
    fsubrp st(1),st(0)
    fmulp st(1),st(0)
    fld qword ptr [Ep24]
    fmulp st(1),st(0)
    fdivrp st(1),st(0)
    fstp qword ptr [result]
  }

  return result;
}

This version will produce slightly different results to the version posted. The compiler will probably be writing intermediate results to RAM in the original code. This will lose precision since the (Intel) FPU operates at 80bits internally whereas the double type is only 64bits. The above assembler will not lose precision in the intermediate results, it is all done at 80bits. Only the final result is rounded to 64bits.

Skizz
  • 69,698
  • 10
  • 71
  • 108
  • I've not used gcc's inline assembler - if anyone can append a version that will work with gcc compiler that would be much appreciated. – Skizz Feb 07 '11 at 16:00
  • That's very helpful, thank you Skizz. I have one question though. The value of Si is 3.405E-10, and r will be at most about 7E-10. Does the use of Si^6 (line 1) raise any major precision issues? – sn6uv Feb 07 '11 at 16:03
  • Hands up - the -ffast-math option is pretty cool. Obviously, compilers have come on a bit since I last looked into it in detail. – Skizz Feb 07 '11 at 16:51
1

Ah, that brings me back some memories... I've done MD with Lennard Jones potential years ago.

In my scenario (not huge systems) it was enough to replace the pow() with several multiplications, as suggested by another answer. I also restricted the range of neighbours, effectively truncating the potential at about r ~ 3.5 and applying some standard thermodynamic correction afterwards.

But if all this is not enough for you, I suggest to precompute the function for closely spaced values of r and simply interpolate (linear or quadratic, I'd say).

leonbloy
  • 73,180
  • 20
  • 142
  • 190
1

Is your application structured in such a way that you could profitably vectorise this function, calculating several independent values in parallel? This would allow you to utilise hardware vector units, such as SSE.

It also seems like you would be better off keeping 1/r values around, rather than r itself.


This is an example explicitly using SSE2 instructions to implement the function. ljd() calculates two values at once.

static __m128d ljd(__m128d r)
{
    static const __m128d two = { 2.0, 2.0 };
    static const __m128d si = { Si, Si };
    static const __m128d ep24 = { 24 * Ep, 24 * Ep };

    __m128d temp2, temp3;
    __m128d temp = _mm_div_pd(si, r);
    __m128d ep24_r = _mm_div_pd(ep24, r);

    temp = _mm_mul_pd(temp, temp);
    temp2 = _mm_mul_pd(temp, temp);
    temp2 = _mm_mul_pd(temp2, temp);
    temp3 = _mm_mul_pd(temp2, temp2);
    temp3 = _mm_mul_pd(temp3, two);

    return _mm_mul_pd(ep24_r, _mm_sub_pd(temp2, temp3));
}

/* Requires `out` and `in` to be 16-byte aligned */
void ljd_array(double out[], const double in[], int n)
{
    int i;

    for (i = 0; i < n; i += 2)
    {
        _mm_store_pd(out + i, ljd(_mm_load_pd(in + i)));
    }
}

However, it is important to note that recent versions of GCC are often able to vectorise functions like this automatically, as long as you're targetting the right architecture and have optimisation enabled. If you're targetting 32-bit x86, try compiling with -msse2 -O3, and adjust things such that the input and output arrays are 16-byte aligned.

Alignment for static and automatic arrays can be achieved under gcc with the type attribute __attribute__ ((aligned (16))), and for dynamic arrays using the posix_memalign() function.

caf
  • 233,326
  • 40
  • 323
  • 462
  • This function occurs in a highly nested part of my program. It works as follows: At every timestep, for every atom, calculate the force between itself, and every other atom. When looking at the other atoms, multiple threads are used to accelerate this process. In short, I'm not sure whether this means I can "vectorise the function". – sn6uv Feb 08 '11 at 02:57
  • 1
    @Angus: Then it sounds like you can, because all the calculations at time step `t` are independent of each other. Eg. You could use SSE2 intrinsics like `_mm_mul_pd()` to calculate the force for two `r` values simultaneously. – caf Feb 08 '11 at 03:13
1

The local variable is just fine. It doesn't cost anything. Leave it alone.

As others said, get rid of the pow call. It can't be any faster than simply squaring the number, and it could be a lot slower.

That said, just because the function is active 80+% of the time does not mean it's a problem. It only means if there is something you can optimize, it's either in there, or in something it calls (like pow) or in something that calls it.

If you try random pausing, which is a method of stack-sampling, you will see that routine on 80+% of samples, plus the lines within it that are responsible for the time, plus its callers that are responsible for the time, and their callers, and so on. All the lines of code on the stack are jointly responsible for the time.

Optimality is not when nothing take a large percent of time, it is when nothing you can fix takes a large percent of time.

Community
  • 1
  • 1
Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135