How can I improve performance via a high-level approach when implementing long equations in C++

Question

I am developing some engineering simulations. This involves implementing some long equations such as this equation to calculate stress in a rubber like material:

T = (
    mu * (
            pow(l1 * pow(l1 * l2 * l3, -0.1e1 / 0.3e1), a) * a
            * (
                pow(l1 * l2 * l3, -0.1e1 / 0.3e1)
                - l1 * l2 * l3 * pow(l1 * l2 * l3, -0.4e1 / 0.3e1) / 0.3e1
            ) * pow(l1 * l2 * l3, 0.1e1 / 0.3e1) / l1
            - pow(l2 * pow(l1 * l2 * l3, -0.1e1 / 0.3e1), a) * a / l1 / 0.3e1
            - pow(l3 * pow(l1 * l2 * l3, -0.1e1 / 0.3e1), a) * a / l1 / 0.3e1
        ) / a
    + K * (l1 * l2 * l3 - 0.1e1) * l2 * l3
) * N1 / l2 / l3

+ (
    mu * (
        - pow(l1 * pow(l1 * l2 * l3, -0.1e1 / 0.3e1), a) * a / l2 / 0.3e1
        + pow(l2 * pow(l1 * l2 * l3, -0.1e1 / 0.3e1), a) * a
        * (
            pow(l1 * l2 * l3, -0.1e1 / 0.3e1)
            - l1 * l2 * l3 * pow(l1 * l2 * l3, -0.4e1 / 0.3e1) / 0.3e1
        ) * pow(l1 * l2 * l3, 0.1e1 / 0.3e1) / l2
        - pow(l3 * pow(l1 * l2 * l3, -0.1e1 / 0.3e1), a) * a / l2 / 0.3e1
    ) / a
    + K * (l1 * l2 * l3 - 0.1e1) * l1 * l3
) * N2 / l1 / l3

+ (
    mu * (
        - pow(l1 * pow(l1 * l2 * l3, -0.1e1 / 0.3e1), a) * a / l3 / 0.3e1
        - pow(l2 * pow(l1 * l2 * l3, -0.1e1 / 0.3e1), a) * a / l3 / 0.3e1
        + pow(l3 * pow(l1 * l2 * l3, -0.1e1 / 0.3e1), a) * a
        * (
            pow(l1 * l2 * l3, -0.1e1 / 0.3e1)
            - l1 * l2 * l3 * pow(l1 * l2 * l3, -0.4e1 / 0.3e1) / 0.3e1
        ) * pow(l1 * l2 * l3, 0.1e1 / 0.3e1) / l3
    ) / a
+ K * (l1 * l2 * l3 - 0.1e1) * l1 * l2
) * N3 / l1 / l2;

I use Maple to generate the C++ code to avoid mistakes (and save time with tedious algebra). As this code is executed thousands (if not millions) of times, the performance is a concern. Unfortunately the math only simplifies so far; the long equations are unavoidable.

What approach can I take to optimize this implementation? I'm looking for high-level strategies that I should be applying when implementing such equations, not necessarily specific optimizations for the example shown above.

I'm compiling using g++ with --enable-optimize=-O3.

Update:

I know there are a lot of repeated expressions, I am using the assumption that the compiler would handle these; my tests so far suggest it does.

l1, l2, l3, mu, a, K are all positive real numbers (not zero).

I have replaced l1*l2*l3 with an equivalent variable: J. This did help improve performance.

Replacing pow(x, 0.1e1/0.3e1) with cbrt(x) was a good suggestion.

This will be run on CPUs, In the near future this would likely run better on GPUs, but for now that option is not available.

Well the first thing that comes to mind (unless the compiler will optimize it itself) is to replace all those `pow(l1 * l2 * l3, -0.1e1 / 0.3e1)` with a variable... You need to benchmark your code to be sure whether it runs fast or slow, though. — SingerOfTheFall, Oct 02 '15 at 12:36
Perhaps doing the calculation `l1 * l2 * l3` once and storing that as another variable might help — Ed Heal, Oct 02 '15 at 12:36
Also format the code to make it more readable - may help in identifying possibilities for improvement. — Ed Heal, Oct 02 '15 at 12:37
Division is expensive. If you do it by hand I would start by converting all the real divisions into a single number, e.g " -0.1e1 / 0.3e1" -> "-0.33333333". Then optimize the `pow` function if needed. — Thane Plummer, Oct 02 '15 at 12:38
Above that horrible mess of code, add a note in the comment to the full algorithm, and the maple file that generated the code. That way, if an error is located in the algorithm, you can quickly correct it, and replace the code — user3791372, Oct 02 '15 at 12:48
@ThanePlummer - The compile will do those constant divisions for you. — Ed Heal, Oct 02 '15 at 12:54
fyi... `l1 * l2 * l3` is "calculated" 24 times. `pow(l1 * l2 * l3, -0.1e1 / 0.3e1)` 12 times. `pow(l1 * l2 * l3, -0.4e1 / 0.3e1)` 3 times. — franji1, Oct 02 '15 at 13:01
I have done it by hand, so no guarantee it is correct: http://pastebin.com/umKJnVAx — Simon Kraemer, Oct 02 '15 at 13:03
The formatting and notation are generated by maple, which is why it looks so bad. My actual implementation of is broken down to make it more readable, although this does nothing for the performance. I know there are a lot of repeated operations here, but I think this will be taken care by the compiler optimisation. I couldn't get any performance gain from storing the results from the pow operations. — , Oct 02 '15 at 13:15
The most successful optimizations usually come from improving the control flow and data layout of your code. There often isn't much to be gained from optimizing a single mathematical expression. — David Brown, Oct 02 '15 at 18:21
@DavidBrown - What you wrote oftentimes is not correct in scientific programming. One poorly written expression can eat your lunch. — David Hammen, Oct 02 '15 at 19:30
@Lefti - What is the nature of `l1`, `l2`, and `l3`? If they are positive real numbers (e.g., `float` or `double`) that nasty expression reduces to something much, much simpler. It also reduces a bit (but not as much) if those variables are complex and you are using a complex version of `pow` that calculates the principal value. — David Hammen, Oct 02 '15 at 21:19
Of all the hints and answers given here (which are "good", be it at least for increasing readability or bringing other insights), I'd be curious to know which one actually had **noticable *effects* on performance** ... — Marco13, Oct 02 '15 at 23:20
what strange compiling option is that? to optimize you use `g++ -O3` — phuclv, Oct 03 '15 at 06:33
@LưuVĩnhPhúc - That compilation option doesn't affect floating point expressions. Unless told to do things unsafely, the compiler cannot assume that arithmetic is associative, commutative, and distributive. It cannot assume that `a+b-a` or `a*b/a` are equal to `b`. (This is particularly so with `a*b/a`; what if `a` is zero?) It cannot rearrange expressions. You need to tell the compiler to use `-ffast-math` (gcc, clang) to enable floating point optimizations. — David Hammen, Oct 03 '15 at 11:32
@Marco13 - I tested the original expression, the expression in Vlad Feinstein's answer, and the expression in my (final) answer, all using double precision. The first test: Do they agree with the original? The answer is yes, to within an ULP or two, for a wide range of positive values for `l1`, `l2`, and `l3`. In other words, both Vlad's answer and mine are correct. The next tests involved performance. I compiled all three as separate functions with clang 3.5, optimization level `-O2`, and with and without `-ffast-math`. My driver calls those functions 40 million times. (continued) — David Hammen, Oct 03 '15 at 18:08
Without `-ffast-math`, Vlad's answer consumes 23% less CPU time than does the original. My answer consumes 50% less CPU time than does the original. Adding `-ffast-math` optimization barely registers with my answer. It makes Vlad's algorithm about 7% faster than without. This option has a significant effect on the original; it's about 20% faster with `-ffast-math` than without. With this optimization, Vlad's answer still consumes 11% less CPU time than does the original; mine, about 37% less CPU time. — David Hammen, Oct 03 '15 at 18:11
It seems that your code currently works, and you are looking to improve it. Generally these questions are too opinionated for this site but you might find better luck at [CodeReview.SE](http://codereview.stackexchange.com/tour). Remember to read [their question requirements](http://codereview.stackexchange.com/help/on-topic) as they are more strict than this site. — Kyll, Oct 03 '15 at 18:16
It's not that they're stricter, just that they're very different, and strongly enforced. If your code is working as intended, and you include the original code in the question, and you aren't asking us to change *what* your code does, it'll be fine. — Kaz, Oct 03 '15 at 18:19
Why all the downvotes and votes to close? For those of you who don't like numerical or scientific programming, go look at other questions. This is a good question that is well suited to this site. The scicomp site is still beta; migration there is not a good option. The code review site doesn't get enough sciomp eyes. What the OP did happens quite often in scientific computing: Construct a problem in a symbolic math program, ask the program to generate code, and don't touch the result because the generated code is such a mess. — David Hammen, Oct 03 '15 at 19:20
What is the value of `a`? If it has the chance of ever being simple (integer, integer+1/2, integer+1/3, integer+1/4 etc, significant optimisations in addition to those already proposed are possible. Similarly, if `l1==l2` or `l1==l3` or `l2==l3` could happen. — Walter, Oct 03 '15 at 21:22
Couldn't more strongly agree with @DavidHammen. Numerical and scientific computing should not get short shrift on SO. — paisanco, Oct 03 '15 at 21:49
@DavidHammen *the Code Review site doesn't get enough sciomp eyes* - sounds like a chicken-and-egg problem, and a mindset that isn't helping CR to get any more of such eyes. Same applies to the idea of turning down the scicomp beta site *because it's beta* - if everyone thought like that, the only site to grow would be Stack Overflow. — Mathieu Guindon, Oct 03 '15 at 21:58
This question is being discussed on meta [here](http://meta.stackoverflow.com/questions/307327/should-this-specific-hot-question-about-high-level-approaches-to-improve-perform/307338#307338) — NathanOliver, Oct 03 '15 at 22:44
You should read: "What Every Computer Scientist Should Know About Floating-Point Arithmetic" (http://docs.oracle.com/cd/E19957-01/806-3568/ncg_goldberg.html). — Micha Wiedenmann, Oct 04 '15 at 08:03
If performance is really what you're after, is it possible for you to use the GPU to perform your mathematical calculations in a shader using GLSL? — code_dredd, Oct 04 '15 at 11:10
When transferring it to GPU take care to employ double precision! AFAIK shader widely only support half precision. — math, Oct 09 '15 at 08:55

score 88 · Accepted Answer · edited May 23 '17 at 12:34

Edit summary

My original answer merely noted that the code contained a lot of replicated computations and that many of the powers involved factors of 1/3. For example, pow(x, 0.1e1/0.3e1) is the same as cbrt(x).
My second edit was just wrong, and and my third extrapolated on this wrongness. This is what makes people afraid to change the oracle-like results from symbolic math programs that start with the letter 'M'. I've stricken out (i.e., ~~strike~~) those edits and pushed them to the bottom of the current revision of this answer. However, I did not delete them. I'm human. It's easy for us to make a mistake.
My fourth edit developed a very compact expression that correctly represents the convoluted expression in the question IF the parameters l1, l2, and l3 are positive real numbers and if a is a non-zero real number. (We have yet to hear from the OP regarding the specific nature of these coefficients. Given the nature of the problem, these are reasonable assumptions.)
This edit attempts to answer the generic problem of how to simplify these expressions.

First things first

I use Maple to generate the C++ code to avoid mistakes.

Maple and Mathematica sometimes miss the obvious. Even more importantly, the users of Maple and Mathematica sometimes make mistakes. Substituting "oftentimes", or maybe even "almost always", in lieu of "sometimes is probably closer to the mark.

You could have helped Maple simplify that expression by telling it about the parameters in question. In the example at hand, I suspect that l1, l2, and l3 are positive real numbers and that a is a non-zero real number. If that's the case, tell it that. Those symbolic math programs typically assume the quantities at hand are complex. Restricting the domain lets the program make assumptions that are not valid in the complex numbers.

How to simplify those big messes from symbolic math programs (this edit)

Symbolic math programs typically provide the ability to provide information about the various parameters. Use that ability, particularly if your problem involves division or exponentiation. In the example at hand, you could have helped Maple simplify that expression by telling it that l1, l2, and l3 are positive real numbers and that a is a non-zero real number. If that's the case, tell it that. Those symbolic math programs typically assume the quantities at hand are complex. Restricting the domain lets the program make assumptions such as a^xb^x=(ab)^x. This is only if a and b are positive real numbers and if x is real. It is not valid in the complex numbers.

Ultimately, those symbolic math programs follow algorithms. Help it along. Try playing with expanding, collecting, and simplifying before you generate code. In this case, you could have collected those terms involving a factor of mu and those involving a factor of K. Reducing an expression to its "simplest form" remains a bit of an art.

When you get an ugly mess of generated code, don't accept it as a truth that you must not touch. Try to simplify it yourself. Look at what the symbolic math program had before it generated code. Look at how I reduced your expression to something much simpler and much faster, and how Walter's answer took mine several steps further. There is no magic recipe. If there was a magical recipe, Maple would have applied it and given the answer that Walter gave.

About the specific question

You are doing a lot of addition and subtraction in that calculation. You can get in deep trouble if you have terms that nearly cancel one another. You are wasting a lot of CPU if you have one term that dominates over the others.

Next, you are wasting a lot of CPU by performing repeated calculations. Unless you have enabled -ffast-math, which lets the compiler break some of the rules of IEEE floating point, the compiler will not (in fact, must not) simplify that expression for you. It will instead do exactly what you told it to do. At a minimum, you should calculate l1 * l2 * l3 prior to computing that mess.

Finally, you are making a lot of calls to pow, which is extremely slow. Note that several of those calls are of the form (l1*l2*l3)^(1/3). Many of those calls to pow could be performed with a single call to std::cbrt:

l123 = l1 * l2 * l3;
l123_pow_1_3 = std::cbrt(l123);
l123_pow_4_3 = l123 * l123_pow_1_3;

With this,

X * pow(l1 * l2 * l3, 0.1e1 / 0.3e1) becomes X * l123_pow_1_3.
X * pow(l1 * l2 * l3, -0.1e1 / 0.3e1) becomes X / l123_pow_1_3.
X * pow(l1 * l2 * l3, 0.4e1 / 0.3e1) becomes X * l123_pow_4_3.
X * pow(l1 * l2 * l3, -0.4e1 / 0.3e1) becomes X / l123_pow_4_3.

Maple did miss the obvious.
For example, there's a much easier way to write

(pow(l1 * l2 * l3, -0.1e1 / 0.3e1) - l1 * l2 * l3 * pow(l1 * l2 * l3, -0.4e1 / 0.3e1) / 0.3e1)

Assuming that l1, l2, and l3 are real rather than complex numbers, and that the real cube root (rather than the principle complex root) are to be extracted, the above reduces to

2.0/(3.0 * pow(l1 * l2 * l3, 1.0/3.0))

or

2.0/(3.0 * l123_pow_1_3)

Using cbrt_l123 instead of l123_pow_1_3, the nasty expression in the question reduces to

l123 = l1 * l2 * l3; 
cbrt_l123 = cbrt(l123);
T = 
  mu/(3.0*l123)*(  pow(l1/cbrt_l123,a)*(2.0*N1-N2-N3)
                 + pow(l2/cbrt_l123,a)*(2.0*N2-N3-N1)
                 + pow(l3/cbrt_l123,a)*(2.0*N3-N1-N2))
 +K*(l123-1.0)*(N1+N2+N3);

Always double check, but always simplify as well.

Here are some of my steps in arriving at the above:

// Step 0: Trim all whitespace.
T=(mu*(pow(l1*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a*(pow(l1*l2*l3,-0.1e1/0.3e1)-l1*l2*l3*pow(l1*l2*l3,-0.4e1/0.3e1)/0.3e1)*pow(l1*l2*l3,0.1e1/0.3e1)/l1-pow(l2*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a/l1/0.3e1-pow(l3*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a/l1/0.3e1)/a+K*(l1*l2*l3-0.1e1)*l2*l3)*N1/l2/l3+(mu*(-pow(l1*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a/l2/0.3e1+pow(l2*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a*(pow(l1*l2*l3,-0.1e1/0.3e1)-l1*l2*l3*pow(l1*l2*l3,-0.4e1/0.3e1)/0.3e1)*pow(l1*l2*l3,0.1e1/0.3e1)/l2-pow(l3*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a/l2/0.3e1)/a+K*(l1*l2*l3-0.1e1)*l1*l3)*N2/l1/l3+(mu*(-pow(l1*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a/l3/0.3e1-pow(l2*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a/l3/0.3e1+pow(l3*pow(l1*l2*l3,-0.1e1/0.3e1),a)*a*(pow(l1*l2*l3,-0.1e1/0.3e1)-l1*l2*l3*pow(l1*l2*l3,-0.4e1/0.3e1)/0.3e1)*pow(l1*l2*l3,0.1e1/0.3e1)/l3)/a+K*(l1*l2*l3-0.1e1)*l1*l2)*N3/l1/l2;

// Step 1:
//   l1*l2*l3 -> l123
//   0.1e1 -> 1.0
//   0.4e1 -> 4.0
//   0.3e1 -> 3
l123 = l1 * l2 * l3;
T=(mu*(pow(l1*pow(l123,-1.0/3),a)*a*(pow(l123,-1.0/3)-l123*pow(l123,-4.0/3)/3)*pow(l123,1.0/3)/l1-pow(l2*pow(l123,-1.0/3),a)*a/l1/3-pow(l3*pow(l123,-1.0/3),a)*a/l1/3)/a+K*(l123-1.0)*l2*l3)*N1/l2/l3+(mu*(-pow(l1*pow(l123,-1.0/3),a)*a/l2/3+pow(l2*pow(l123,-1.0/3),a)*a*(pow(l123,-1.0/3)-l123*pow(l123,-4.0/3)/3)*pow(l123,1.0/3)/l2-pow(l3*pow(l123,-1.0/3),a)*a/l2/3)/a+K*(l123-1.0)*l1*l3)*N2/l1/l3+(mu*(-pow(l1*pow(l123,-1.0/3),a)*a/l3/3-pow(l2*pow(l123,-1.0/3),a)*a/l3/3+pow(l3*pow(l123,-1.0/3),a)*a*(pow(l123,-1.0/3)-l123*pow(l123,-4.0/3)/3)*pow(l123,1.0/3)/l3)/a+K*(l123-1.0)*l1*l2)*N3/l1/l2;

// Step 2:
//   pow(l123,1.0/3) -> cbrt_l123
//   l123*pow(l123,-4.0/3) -> pow(l123,-1.0/3)
//   (pow(l123,-1.0/3)-pow(l123,-1.0/3)/3) -> 2.0/(3.0*cbrt_l123)
//   *pow(l123,-1.0/3) -> /cbrt_l123
l123 = l1 * l2 * l3;
cbrt_l123 = cbrt(l123);
T=(mu*(pow(l1/cbrt_l123,a)*a*2.0/(3.0*cbrt_l123)*cbrt_l123/l1-pow(l2/cbrt_l123,a)*a/l1/3-pow(l3/cbrt_l123,a)*a/l1/3)/a+K*(l123-1.0)*l2*l3)*N1/l2/l3+(mu*(-pow(l1/cbrt_l123,a)*a/l2/3+pow(l2/cbrt_l123,a)*a*2.0/(3.0*cbrt_l123)*cbrt_l123/l2-pow(l3/cbrt_l123,a)*a/l2/3)/a+K*(l123-1.0)*l1*l3)*N2/l1/l3+(mu*(-pow(l1/cbrt_l123,a)*a/l3/3-pow(l2/cbrt_l123,a)*a/l3/3+pow(l3/cbrt_l123,a)*a*2.0/(3.0*cbrt_l123)*cbrt_l123/l3)/a+K*(l123-1.0)*l1*l2)*N3/l1/l2;

// Step 3:
//   Whitespace is nice.
l123 = l1 * l2 * l3;
cbrt_l123 = cbrt(l123);
T =
  (mu*( pow(l1/cbrt_l123,a)*a*2.0/(3.0*cbrt_l123)*cbrt_l123/l1
       -pow(l2/cbrt_l123,a)*a/l1/3
       -pow(l3/cbrt_l123,a)*a/l1/3)/a
   +K*(l123-1.0)*l2*l3)*N1/l2/l3
 +(mu*(-pow(l1/cbrt_l123,a)*a/l2/3
       +pow(l2/cbrt_l123,a)*a*2.0/(3.0*cbrt_l123)*cbrt_l123/l2
       -pow(l3/cbrt_l123,a)*a/l2/3)/a
   +K*(l123-1.0)*l1*l3)*N2/l1/l3
 +(mu*(-pow(l1/cbrt_l123,a)*a/l3/3
       -pow(l2/cbrt_l123,a)*a/l3/3
       +pow(l3/cbrt_l123,a)*a*2.0/(3.0*cbrt_l123)*cbrt_l123/l3)/a
   +K*(l123-1.0)*l1*l2)*N3/l1/l2;

// Step 4:
//   Eliminate the 'a' in (term1*a + term2*a + term3*a)/a
//   Expand (mu_term + K_term)*something to mu_term*something + K_term*something
l123 = l1 * l2 * l3;
cbrt_l123 = cbrt(l123);
T =
  (mu*( pow(l1/cbrt_l123,a)*2.0/(3.0*cbrt_l123)*cbrt_l123/l1
       -pow(l2/cbrt_l123,a)/l1/3
       -pow(l3/cbrt_l123,a)/l1/3))*N1/l2/l3
 +K*(l123-1.0)*l2*l3*N1/l2/l3
 +(mu*(-pow(l1/cbrt_l123,a)/l2/3
       +pow(l2/cbrt_l123,a)*2.0/(3.0*cbrt_l123)*cbrt_l123/l2
       -pow(l3/cbrt_l123,a)/l2/3))*N2/l1/l3
 +K*(l123-1.0)*l1*l3*N2/l1/l3
 +(mu*(-pow(l1/cbrt_l123,a)/l3/3
       -pow(l2/cbrt_l123,a)/l3/3
       +pow(l3/cbrt_l123,a)*2.0/(3.0*cbrt_l123)*cbrt_l123/l3))*N3/l1/l2
 +K*(l123-1.0)*l1*l2*N3/l1/l2;

// Step 5:
//   Rearrange
//   Reduce l2*l3*N1/l2/l3 to N1 (and similar)
//   Reduce 2.0/(3.0*cbrt_l123)*cbrt_l123/l1 to 2.0/3.0/l1 (and similar)
l123 = l1 * l2 * l3;
cbrt_l123 = cbrt(l123);
T =
  (mu*( pow(l1/cbrt_l123,a)*2.0/3.0/l1
       -pow(l2/cbrt_l123,a)/l1/3
       -pow(l3/cbrt_l123,a)/l1/3))*N1/l2/l3
 +(mu*(-pow(l1/cbrt_l123,a)/l2/3
       +pow(l2/cbrt_l123,a)*2.0/3.0/l2
       -pow(l3/cbrt_l123,a)/l2/3))*N2/l1/l3
 +(mu*(-pow(l1/cbrt_l123,a)/l3/3
       -pow(l2/cbrt_l123,a)/l3/3
       +pow(l3/cbrt_l123,a)*2.0/3.0/l3))*N3/l1/l2
 +K*(l123-1.0)*N1
 +K*(l123-1.0)*N2
 +K*(l123-1.0)*N3;

// Step 6:
//   Factor out mu and K*(l123-1.0)
l123 = l1 * l2 * l3;
cbrt_l123 = cbrt(l123);
T =
  mu*(  ( pow(l1/cbrt_l123,a)*2.0/3.0/l1
         -pow(l2/cbrt_l123,a)/l1/3
         -pow(l3/cbrt_l123,a)/l1/3)*N1/l2/l3
      + (-pow(l1/cbrt_l123,a)/l2/3
         +pow(l2/cbrt_l123,a)*2.0/3.0/l2
         -pow(l3/cbrt_l123,a)/l2/3)*N2/l1/l3
      + (-pow(l1/cbrt_l123,a)/l3/3
         -pow(l2/cbrt_l123,a)/l3/3
         +pow(l3/cbrt_l123,a)*2.0/3.0/l3)*N3/l1/l2)
 +K*(l123-1.0)*(N1+N2+N3);

// Step 7:
//   Expand
l123 = l1 * l2 * l3;
cbrt_l123 = cbrt(l123);
T =
  mu*( pow(l1/cbrt_l123,a)*2.0/3.0/l1*N1/l2/l3
      -pow(l2/cbrt_l123,a)/l1/3*N1/l2/l3
      -pow(l3/cbrt_l123,a)/l1/3*N1/l2/l3
      -pow(l1/cbrt_l123,a)/l2/3*N2/l1/l3
      +pow(l2/cbrt_l123,a)*2.0/3.0/l2*N2/l1/l3
      -pow(l3/cbrt_l123,a)/l2/3*N2/l1/l3
      -pow(l1/cbrt_l123,a)/l3/3*N3/l1/l2
      -pow(l2/cbrt_l123,a)/l3/3*N3/l1/l2
      +pow(l3/cbrt_l123,a)*2.0/3.0/l3*N3/l1/l2)
 +K*(l123-1.0)*(N1+N2+N3);

// Step 8:
//   Simplify.
l123 = l1 * l2 * l3;
cbrt_l123 = cbrt(l123);
T =
  mu/(3.0*l123)*(  pow(l1/cbrt_l123,a)*(2.0*N1-N2-N3)
                 + pow(l2/cbrt_l123,a)*(2.0*N2-N3-N1)
                 + pow(l3/cbrt_l123,a)*(2.0*N3-N1-N2))
 +K*(l123-1.0)*(N1+N2+N3);

Wrong answer, intentionally kept for humility

Note that this is stricken. It's wrong.

~~Update~~

Maple did miss the obvious. For example, there's a much easier way to write

(pow(l1 * l2 * l3, -0.1e1 / 0.3e1) - l1 * l2 * l3 * pow(l1 * l2 * l3, -0.4e1 / 0.3e1) / 0.3e1)

Assuming that l1, l2, and l3 are real rather than complex numbers, and that the real cube root (rather than the principle complex root) are to be extracted, the above reduces to zero. This calculation of zero is repeated many times over.

Second update

If I've done the math right (there is no guarantee that I've done the math right), the nasty expression in the question reduces to

l123 = l1 * l2 * l3; 
cbrt_l123_inv = 1.0 / cbrt(l123);
nasty_expression =
    K * (l123 - 1.0) * (N1 + N2 + N3) 
    - (  pow(l1 * cbrt_l123_inv, a) * (N2 + N3) 
       + pow(l2 * cbrt_l123_inv, a) * (N1 + N3) 
       + pow(l3 * cbrt_l123_inv, a) * (N1 + N2)) * mu / (3.0*l123);

The above assumes that l1, l2, and l3 are positive real numbers.

Well, CSE-elimination *should* work, independent of relaxed semantics (and OP indicated in comments it does). Though of course, if it matters (measured), that should be inspected (generated assembly). Your points about dominating terms, missed formula-simplifications, better specialized functions and the dangers of cancellation are very good. — Deduplicator, Oct 02 '15 at 19:32
@Deduplicator - Not with floating point. Unless one enables unsafe math optimizations (e.g., by specifying `-ffast-math` with gcc or clang), the compiler cannot rely on `pow(x,-1.0/3.0)` being equal to `x*pow(x,-4.0/3.0)`. The latter might underflow while the first might not. To be compliant with the floating point standard, the compiler must not optimize that calculation to zero. — David Hammen, Oct 02 '15 at 19:40
Well, those are quite a lot more ambitious than anything I meant. — Deduplicator, Oct 02 '15 at 19:43
@Deduplicator: As I [commented on another answer](http://stackoverflow.com/questions/32907258/performance-when-implementing-long-equations-in-c/32908665?noredirect=1#comment53659284_32908665): You need `-fno-math-errno` for g++ to CSE identical `pow` calls. (Unless maybe it can prove that pow won't need to set errno?) — Peter Cordes, Oct 02 '15 at 21:25
The flag is `-ffast-math`, not `-ffast_math`. In general, underscores are rare in command-line options. — user253751, Oct 03 '15 at 11:52
Why did you stop there? See my answer below for how to simplify further. — Walter, Oct 03 '15 at 22:05
Thanks for the update on simplifying the expressions as much as possible in the CAS. I actually updated my response as well to suggest this, but yours is much more comprehensive. I usually find that doing those steps results in much better generated code, and it may not even be worth my time to reduce it further. — neocpp, Oct 05 '15 at 13:45
This is just what I was looking for, thanks. I tried this out and it gave me a faster result on my crude benchmarks. I'll do more testing and apply this approach to my other long equations. — , Oct 05 '15 at 16:15
@Lefti - Do take a lot at Walter's answer. It's a good deal faster. There is a potential issue with all of these answers, which is numerical cancellation. Assuming your `N1`, `N2`, and `N3` are non-negative, one of the `2*N_i-(N_j+N_k)` will be negative, one will be positive, and the other will be somewhere in between. This can easily result in numerical cancellation problems. — David Hammen, Oct 07 '15 at 22:08

score 32 · Answer 2 · edited May 23 '17 at 10:30

32

First thing to note is that pow is really expensive, so you should get rid of this as much as possible. Scanning through the expression I see many repetitions of pow(l1 * l2 * l3, -0.1e1 / 0.3e1) and pow(l1 * l2 * l3, -0.4e1 / 0.3e1). So I would expect a big gain from pre-computing those:

 const double c1 = pow(l1 * l2 * l3, -0.1e1 / 0.3e1);
const double c2 = boost::math::pow<4>(c1);

where I am using the boost pow function.

Furthermore, you have some more pow with exponent a. If a is Integer and known at compiler time, you can also replace those with boost::math::pow<a>(...) to gain further performance. I would also suggest to replace terms like a / l1 / 0.3e1 with a / (l1 * 0.3e1) as multiplication is faster then division.

Finally, if you use g++ you can use the -ffast-math flag that allows the optimizer to be more aggressive in transforming equations. Read about what this flag actually does, as it has side effects though.

edited May 23 '17 at 10:30

Community

1
1

answered Oct 02 '15 at 15:04

mariomulansky

963
5
7

5

In our code, using `-ffast-math` leads the code to go unstable or give flat out wrong answers. We have a similar issue with Intel compilers and have to use the `-fp-model precise` option, otherwise the code either blows up or gives the wrong answers. So `-ffast-math` could speed it up, but I would recommend proceeding very cautiously with that option, in addition to the side-effects listed in your linked question. – tpg2114 Oct 02 '15 at 17:47
2

@tpg2114: [According to my testing, you only need `-fno-math-errno`](http://stackoverflow.com/questions/2940367/what-is-more-efficient-using-pow-to-square-or-just-multiply-it-with-itself/2940800#comment53658268_2940800) for g++ to be able to hoist identical calls to `pow` out of a loop. That's the least "dangerous" part of -ffast-math, for most code. – Peter Cordes Oct 02 '15 at 20:48
1

@PeterCordes Those are interesting results! We've also had issues with `pow` [being extremely slow](http://stackoverflow.com/questions/9272155/replacing-extrordinarily-slow-pow-function) and ended up using the `dlsym` hack mentioned in the comments to get considerable performance increases when we could actually do with a little less precision. – tpg2114 Oct 02 '15 at 21:08
Wouldn't GCC understand that pow is a pure function? That's probably built-in knowledge. – usr Oct 02 '15 at 22:13
6

@usr: That's just the point, I think. `pow` is *not* a pure function, according to the standard, because it is supposed to set `errno` in some circumstances. Setting flags such as `-fno-math-errno` cause it to *not* set `errno` (thus violating the standard), but then it is a pure function and can be optimized as such. – Nate Eldredge Oct 03 '15 at 04:30
@NateEldredge: I don't understand why writing to `errno` should affect the whether `pow` can be factored out. The program never reads `errno` in between the calls, so in both cases, all of those are writes to `errno` except the last one are dead... and the last write would be the same as what you would get with a single write. I really don't think it would change the semantics if the compiler factored it out; I think the compiler is just too stupid to do it. – user541686 Oct 04 '15 at 19:05
@Mehrdad I'm just speculating here, but maybe it's because `errno` needs to be set on the compacted result, namely the 32 bits `float` or the 64 bits `double`, rather than on the value in the machine precision floating point registers. – MicroVirus Oct 05 '15 at 09:39
@MicroVirus: I still don't understand. As long as the resulting `errno` is a function of the inputs, it doesn't matter what it is. Setting it multiple times will be the same as setting it the last time. – user541686 Oct 05 '15 at 09:59
@Mehrdad True, but it looks like nobody wrote optimizer code to detect this situation and enable `-fno-math-errno` automatically :( – Navin Oct 30 '15 at 06:15

score 20 · Answer 3 · answered Oct 02 '15 at 13:48

20

Woah, what a hell of an expression. Creating the expression with Maple actually was a suboptimal choice here. The result is simply unreadable.

chose speaking variable names (not l1, l2, l3, but e.g. height, width, depth, if that is what they mean). Then it is easier for you to understand your own code.
calculate subterms, that you use multiple times, upfront and store the results in variables with speaking names.
You mention, that the expression is evaluated very many times. I guess, only few parameters vary in the inner most loop. Calculate all invariant subterms before that loop. Repeat for the second inner loop and so on until all invariants are outside the loop.

Theoretically the compiler should be able to do all of that for you, but sometimes it can't - e.g. when the loop nesting spreads over multiple functions in different compilation units. Anyway, that will give you much better readable, understandable, and maintainable code.

answered Oct 02 '15 at 13:48

cdonat

2,748
16
24

8

"the compiler should do it, but sometimes it doesn't", is key here. besides readability, of course. – Javier Oct 02 '15 at 13:52
3

If the compiler is not required to do something, then assuming that is almost always wrong. – edmz Oct 02 '15 at 17:17
4

Re *Choose speaking variable names* -- A lot of times that nice rule does not apply when you're doing math. When looking at code that is supposed to implement an algorithm in a scientific journal, I'd much rather see the symbols in the code be exactly those used in the journal. Typically, that means extremely short names, possibly with a subscript. – David Hammen Oct 02 '15 at 19:07
8

"The result is simply unreadable" - why is that a issue? You wouldn't care that the high level language output from a lexer- or parser-generator was "unreadable" (by humans). What matters here is that the *input to the code-generator* (Maple) is readable and checkable. The thing *not* to do is edit the generated code by hand, if you want to be confident it is error-free. – alephzero Oct 02 '15 at 19:21
3

@DavidHammen: Well, in that case, the one-letter ones *are* the "speaking names". E.g., when doing geometry in a 2-dimensional cartesian coordinate system, `x` and `y` are *not* meaningless single-letter variables, they are whole *words* with a precise definition and a well and widely understood meaning. – Jörg W Mittag Oct 02 '15 at 19:36
1

@JörgWMittag - Exactly. I've had people complain about my names in code that implements a math paper. The names in the code are the names in the paper. When I'm doing computer science code, my names can be quite long. But when I'm implementing a paper, my names are exactly those in the paper. It makes it easier to inspect the code. – David Hammen Oct 02 '15 at 19:52
It's worse than that for common-subexpression-elimination of `pow` calls. [According to my testing, you need `-fno-math-errno`](http://stackoverflow.com/questions/2940367/what-is-more-efficient-using-pow-to-square-or-just-multiply-it-with-itself/2940800#comment53658268_2940800) for g++ to be able to hoist identical calls to `pow` out of a loop. Given how little compilers generally do without most of the other `-ffast-math` options, hoisting common subexpressions yourself sounds like a really good idea. – Peter Cordes Oct 02 '15 at 20:54
@PeterCordes - Looking for expressions that are a very complicated way of writing zero is also a good idea. My experience is that an overly long expression generated generated by Macsyma, Maple, Matlab, Mathematica, etc. is almost always reducible to something much, much simpler. Reducing a complicated mathematical expression to its simplest form is still an art. – David Hammen Oct 02 '15 at 21:09
@PeterCordes - If my math is correct (see my answer), there is no way a computer will be able to reduce the expression in the original expression to my simple expression that involves but three calls to `pow` (plus one call to `cbrt`) before I retire. That said, my retirement is at most ten years away. – David Hammen Oct 02 '15 at 21:13
Also note that the common subexpression in `x * a / b` and `y * a / b` cannot be eliminated by the compiler, because there is none ;) Only the programmer can know if it is ok to replace `x * a / b` by `x * (a / b)`. – Carsten S Oct 03 '15 at 18:10

score 18 · Answer 4 · edited May 23 '17 at 12:02

David Hammen's answer is good, but still far from optimal. Let's continue with his last expression (at the time of writing this)

auto l123 = l1 * l2 * l3;
auto cbrt_l123 = cbrt(l123);
T = mu/(3.0*l123)*(  pow(l1/cbrt_l123,a)*(2.0*N1-N2-N3)
                   + pow(l2/cbrt_l123,a)*(2.0*N2-N3-N1)
                   + pow(l3/cbrt_l123,a)*(2.0*N3-N1-N2))
  + K*(l123-1.0)*(N1+N2+N3);

which can be optimised further. In particular, we can avoid the call to cbrt() and one of the calls to pow() if exploiting some mathematical identities. Let's do this again step by step.

// step 1 eliminate cbrt() by taking the exponent into pow()
auto l123 = l1 * l2 * l3;
auto athird = 0.33333333333333333 * a; // avoid division
T = mu/(3.0*l123)*(  (N1+N1-N2-N3)*pow(l1*l1/(l2*l3),athird)
                   + (N2+N2-N3-N1)*pow(l2*l2/(l1*l3),athird)
                   + (N3+N3-N1-N2)*pow(l3*l3/(l1*l2),athird))
  + K*(l123-1.0)*(N1+N2+N3);

Note that I have also optimised 2.0*N1 to N1+N1 etc. Next, we can do with only two calls to pow().

// step 2  eliminate one call to pow
auto l123 = l1 * l2 * l3;
auto athird = 0.33333333333333333 * a;
auto pow_l1l2_athird = pow(l1/l2,athird);
auto pow_l1l3_athird = pow(l1/l3,athird);
auto pow_l2l3_athird = pow_l1l3_athird/pow_l1l2_athird;
T = mu/(3.0*l123)*(  (N1+N1-N2-N3)* pow_l1l2_athird*pow_l1l3_athird
                   + (N2+N2-N3-N1)* pow_l2l3_athird/pow_l1l2_athird
                   + (N3+N3-N1-N2)/(pow_l1l3_athird*pow_l2l3_athird))
  + K*(l123-1.0)*(N1+N2+N3);

Since the calls to pow() are by far the most costly operation here, it is worth to reduce them as far as possible (the next costly operation was the call to cbrt(), which we eliminated).

If by any chance a is integer, the calls to pow could be optimized to calls to cbrt (plus integer powers), or if athird is half-integer, we can use sqrt (plus integer powers). Furthermore, if by any chance l1==l2 or l1==l3 or l2==l3 one or both calls to pow can be eliminated. So, it's worth to consider these as special cases if such chances realistically exist.

@gnat I appreciate your editing (I thought of doing that myself), but would have found it fairer, if David's answer would also link on to this one. Why' don't you also edit David's answer in a similar way? — Walter, Oct 04 '15 at 15:01
I only edited because I saw you explicitly mentioning it; I re-read David's answer and couldn't find reference to your answer over there. I try to avoid edits where it is not 100% clear that stuff I add matches author's intentions — gnat, Oct 04 '15 at 15:33
It certainly wasn't me. I upvoted your answer a few days ago. I also received a random flyby downvote on my answer. Stuff just happens sometimes. — David Hammen, Oct 07 '15 at 22:04
You and I have received a paltry downvote each. Look at all the downvotes on the question! As of now, the question has received 16 downvotes. It has also received 80 upvotes that more than offset all those downvoters. — David Hammen, Oct 07 '15 at 22:11

Vlad Feinstein · Answer 5 · 2015-10-05T21:41:13.017

How many is "many many"?
How long does it take?
Do ALL parameters change between recalculation of this formula? Or can you cache some pre-calculated values?

I've attempted a manual simplification of that formula, would like to know if it saves anything?

C1 = -0.1e1 / 0.3e1;
C2 =  0.1e1 / 0.3e1;
C3 = -0.4e1 / 0.3e1;

X0 = l1 * l2 * l3;
X1 = pow(X0, C1);
X2 = pow(X0, C2);
X3 = pow(X0, C3);
X4 = pow(l1 * X1, a);
X5 = pow(l2 * X1, a);
X6 = pow(l3 * X1, a);
X7 = a / 0.3e1;
X8 = X3 / 0.3e1;
X9 = mu / a;
XA = X0 - 0.1e1;
XB = K * XA;
XC = X1 - X0 * X8;
XD = a * XC * X2;

XE = X4 * X7;
XF = X5 * X7;
XG = X6 * X7;

T = (X9 * ( X4 * XD - XF - XG) / l1 + XB * l2 * l3) * N1 / l2 / l3 
  + (X9 * (-XE + X5 * XD - XG) / l2 + XB * l1 * l3) * N2 / l1 / l3 
  + (X9 * (-XE - XF + X6 * XD) / l3 + XB * l1 * l2) * N3 / l1 / l2;

[ADDED] I've worked some more on the last three-lines formula and got it down to this beauty:

T = X9 / X0 * (
      (X4 * XD - XF - XG) * N1 + 
      (X5 * XD - XE - XG) * N2 + 
      (X5 * XD - XE - XF) * N3)
  + XB * (N1 + N2 + N3)

Let me show my work, step by step:

T = (X9 * (X4 * XD - XF - XG) / l1 + XB * l2 * l3) * N1 / l2 / l3 
  + (X9 * (X5 * XD - XE - XG) / l2 + XB * l1 * l3) * N2 / l1 / l3 
  + (X9 * (X5 * XD - XE - XF) / l3 + XB * l1 * l2) * N3 / l1 / l2;


T = (X9 * (X4 * XD - XF - XG) / l1 + XB * l2 * l3) * N1 / (l2 * l3) 
  + (X9 * (X5 * XD - XE - XG) / l2 + XB * l1 * l3) * N2 / (l1 * l3) 
  + (X9 * (X5 * XD - XE - XF) / l3 + XB * l1 * l2) * N3 / (l1 * l2);

T = (X9 * (X4 * XD - XF - XG) + XB * l1 * l2 * l3) * N1 / (l1 * l2 * l3) 
  + (X9 * (X5 * XD - XE - XG) + XB * l1 * l2 * l3) * N2 / (l1 * l2 * l3) 
  + (X9 * (X5 * XD - XE - XF) + XB * l1 * l2 * l3) * N3 / (l1 * l2 * l3);

T = (X9 * (X4 * XD - XF - XG) + XB * X0) * N1 / X0 
  + (X9 * (X5 * XD - XE - XG) + XB * X0) * N2 / X0 
  + (X9 * (X5 * XD - XE - XF) + XB * X0) * N3 / X0;

T = X9 * (X4 * XD - XF - XG) * N1 / X0 + XB * N1 
  + X9 * (X5 * XD - XE - XG) * N2 / X0 + XB * N2
  + X9 * (X5 * XD - XE - XF) * N3 / X0 + XB * N3;


T = X9 * (X4 * XD - XF - XG) * N1 / X0 
  + X9 * (X5 * XD - XE - XG) * N2 / X0
  + X9 * (X5 * XD - XE - XF) * N3 / X0
  + XB * (N1 + N2 + N3)

That noticeable, huh? :) FORTRAN, IIRC, was designed for efficient formula calculations ("FOR" is for formula). — Vlad Feinstein, Oct 02 '15 at 18:48
Most F77 codes I've seen looked like that (e.g., BLAS & NR). Very glad Fortran 90->2008 exists :) — Kyle Kanos, Oct 02 '15 at 19:24
Yes. If you're translating a formula, what better way than FORmulaTRANslation? — , Oct 03 '15 at 20:19
Your 'optimisation' attacks the wrong place. The costly bits are the calls to `std::pow()`, of which you still have 6, 3 times more than necessary. In other words, your code is 3 times slower than possible. — Walter, Oct 07 '15 at 07:44

neocpp · Answer 6 · 2015-10-05T13:37:59.680

This may be a little terse, but I've actually found good speedup for polynomials (interpolation of energy functions) by using Horner Form, which basically rewrites ax^3 + bx^2 + cx + d as d + x(c + x(b + x(a))). This will avoid a lot of repeated calls to pow() and stops you from doing silly things like separately calling pow(x,6) and pow(x,7) instead of just doing x*pow(x,6).

This is not directly applicable to your current problem, but if you have high order polynomials with integer powers it can help. You might have to watch out for numerical stability and overflow issues since the order of operations is important for that (although in general I actually think Horner Form helps for this, since x^20 and x are usually many orders of magnitude apart).

Also as a practical tip, if you haven't done so already, try to simplify the expression in maple first. You can probably get it to do most of the common subexpression elimination for you. I don't know how much it affects the code generator in that program in particular, but I know in Mathematica doing a FullSimplify before generating the code can result in a huge difference.

The Horner form is pretty standard for coding polynomials and this has no relevance to the question at all. — Walter, Oct 05 '15 at 08:00
This may be true given his example, but you'll notice he said "equations of this type." I thought the answer would be useful if the poster had any polynomials in his system. I've particularly noticed that code generators for the CAS programs such as Mathematica and Maple tend to NOT give you Horner form unless you specifically ask for it; they default to the way you would typically write a polynomial as a human. — neocpp, Oct 05 '15 at 13:29

score 3 · Answer 7 · answered Oct 02 '15 at 12:38

3

It looks like you have a lot of repeated operations going on.

pow(l1 * l2 * l3, -0.1e1 / 0.3e1)
pow(l1 * l2 * l3, -0.4e1 / 0.3e1)

You could pre-calculate those so you are not repeatedly calling the pow function which can be expensive.

You could also pre-calutate

l1 * l2 * l3

as you use that term repeatedly.

answered Oct 02 '15 at 12:38

NathanOliver

171,901
28
288
402

6

I bet the optimizer already does this for you... though it makes at least the code more readable. – Karoly Horvath Oct 02 '15 at 12:40
I did this, but it didn't speed things up at all. I figured this was because the compiler optimisation was already taking care of it. – Oct 02 '15 at 13:10
storing l1 * l2 * l3 does speed things up though, not sure why with the compiler optimisation – Oct 02 '15 at 13:12
because the compiler sometimes just can't do some optimizations, or find them in conflict with other options. – Javier Oct 02 '15 at 13:54
1

In fact, the compiler must not do those optimizations unless `-ffast-math` is enabled, and as noted in a comment by @tpg2114, that optimization can create extremely unstable results. – David Hammen Oct 02 '15 at 19:33

score 0 · Answer 8 · answered Oct 02 '15 at 12:44

0

If you have a Nvidia CUDA graphics card, you could consider offloading the calculations to the graphics card - which itself is more suitable for computationally complicated calculations.

https://developer.nvidia.com/how-to-cuda-c-cpp

If not, you may want to consider multiple threads for calculations.

answered Oct 02 '15 at 12:44

user3791372

4,445
6
44
78

10

This answer is orthogonal to the question at hand. While GPUs do have lots and lots of processors, they are quite slow compared to the FPU embedded with the CPU. Performing a single serial calculation with a GPU is a big loss. The CPU has to fill the pipeline to the GPU, wait for the slow GPU to perform that single task, and then unload the result. While GPUs are absolutely fantastic when the problem at hand is massively parallelizable, they are absolute atrocious when it comes to performing serial tasks. – David Hammen Oct 03 '15 at 18:30
1

In the original question: " As this code is executed many many times the performance is a concern.". That's one more than "many". The op can send the calculations in a threaded manner. – user3791372 Oct 04 '15 at 12:38

score 0 · Answer 9 · answered Oct 08 '15 at 22:28

By any chance, could you supply the calculation symbolically. If there are vector operations, you might really want to investigate using blas or lapack which in some cases can run operations in parallel.

It is conceivable (at the risk of being off-topic?) that you might be able to use python with numpy and/or scipy. To the extent that it was possible, your calculations might be more readable.

score 0 · Answer 10 · answered Oct 09 '15 at 09:18

As you explicitly asked about high level optimizations, it might be worth trying different C++ compilers. Nowadays, compilers are very complex optimization beasts and CPU vendors might implement very powerful and specific optimizations. But please note, some of them are not free (but there might be a free academic program).

GNU compiler collection is free, flexible, and available on many architectures
Intel compilers are very fast, very expensive, and may also produce good results for AMD architectures (I believe there is an academic program)
Clang compilers are fast, free, and might produce similar results to GCC (some people say they are faster, better, but this might differ for each application case, I suggest to make your own experiences)
PGI (Portland Group) is not free as the Intel compilers.
PathScale compilers might perform good results on AMD architectures

I've seen code snippets differ in execution speed by the factor of 2, only by changing the compiler (with full optimizations of course). But be aware of checking the identity of the output. To aggressive optimization might lead to different output, which is something you definitely want to avoid.

Good Luck!