My first post here. A great site and resource.
I did search a bit and looked at the questions with similar titles, but couldn't find something specifically about this.
I'm trying to remove any redundancy and bloat from a C astronomical calculation library that my C++ program uses. I ran a simple profiler (VerySleepy).
Here is the code that the profiler showed as using the most time (aside from C library functions sprintf, etc.):
double swi_echeb(const double x, const double* const coef, const int ncf)
{
int j = ncf - 1;
double x2, br, brp2, brpp;
x2 = x * 2.;
br = 0.;
brp2 = 0.; /* dummy assign to silence gcc warning */
brpp = 0.;
for (; j >= 0; --j) { // <-- 0.39s
brp2 = brpp; // <-- 0.01s
brpp = br; // <-- 0.32s
br = x2 * brpp - brp2 + coef[j]; // <-- 3.49s ***
} // <-- 0.14s
return (br - brp2) * .5; // <-- 0.06s
} // <-- 0.05s
This particular function is deeply embedded within others and the main "kick-off" function that my program calls is called thousands of times.
You can see the standout statement with 3.49s as much higher than all the other statement times. I know there are ways to speed C arithmetic with using multiplication over division when possible. But I don't know much more than that.
Like:
Would it be better to split this statement up into smaller pieces?:
br = x2 * brpp;
br -= brp2;
br += coef[j];
Any other ideas or critiques. I did not write this code, though I did add the const to the function parameters as I love const-correctness.
I've never tried using registers or other fancy tricks to speed things up before. Anyone think something like that can work here?
I know people will say, "Try it!" So I will, and will update what I get if it helps anyone with similar arithmetic questions.
EDIT: Posting Results I've Tested From Suggestions
In order from fastest to slowest, here are what I've found so far. Profiler is VerySleepy. Compiler is Visual Studio 2008 Pro Ed. Compile options for both library and my application are:
Debug, C7 format, /O2 /Ob2 /Oi /Ot /Oy /GT /GL /GF /FD /MTd /GS- /Gy /fp:fast /FAs
The following is Andrew's suggestion about doing "4 iterations per loop". It was the fastest so far.
TOTAL TIME spent in function (times from the other statements in the function are not shown here) = 2.08 seconds
for (; index >= 3; index -= 4) { // 0.02s
brp2 = brpp;
brpp = br; // 0.02s
br = x2 * brpp - brp2 + coef[index]; // 0.25s
brp2 = brpp;
brpp = br; // 0.13s
br = x2 * brpp - brp2 + coef[index - 1]; // 0.33s
brp2 = brpp;
brpp = br; // 0.13s
br = x2 * brpp - brp2 + coef[index - 2]; // 0.34s
brp2 = brpp;
brpp = br; // 0.14s
br = x2 * brpp - brp2 + coef[index - 3]; // 0.42s
}
for (; index >= 0; --index) { // 0.03s
brp2 = brpp; // 0.03s
brpp = br;
br = x2 * brpp - brp2 + coef[index]; // 0.11s
}
The next fastest was the original unaltered code, with a total time of 2.39 seconds inside the function, again including the statements outside the loop. Note that this is less than my original post. My original post was unoptimized code, but since everyone suggested it, all of my tests were subsequently as optimized as I could get in VS08:
for (j = ncf - 1; j >= 0; j--) { // 0.02s
brp2 = brpp; // 0.03s
brpp = br; // 0.07s
br = x2 * brpp - brp2 + coef[j]; // 2.14s
}
After this original code, the next fastest was Drew's idea of setting the pointer in advance and using that. Total time spent inside function was 2.49 seconds, including times from statements outside loop:
for (; index >= coef; --index) { // 0.01s
brp2 = brpp;
brpp = br; // 0.06s
br = x2 * brpp - brp2 + *index; // 2.24s
}
I also tried a mix of both Andrew's loop unrolling and Drew's pointer usage, but that took 2.39 seconds, the same as the unaltered code.
Based on the results, the loop-unrolling is the way to go so far for my usage.