C - have a simple loop that does arithmetic calculation; profiler shows this is a bottleneck. How to speed it up?

Question

My first post here. A great site and resource.

I did search a bit and looked at the questions with similar titles, but couldn't find something specifically about this.

I'm trying to remove any redundancy and bloat from a C astronomical calculation library that my C++ program uses. I ran a simple profiler (VerySleepy).

Here is the code that the profiler showed as using the most time (aside from C library functions sprintf, etc.):

double swi_echeb(const double x, const double* const coef, const int ncf)
{
    int j = ncf - 1;
    double x2, br, brp2, brpp;
    x2 = x * 2.;
    br = 0.;
    brp2 = 0.;  /* dummy assign to silence gcc warning */
    brpp = 0.;

    for (; j >= 0; --j) {                 // <-- 0.39s
        brp2 = brpp;                      // <-- 0.01s
        brpp = br;                        // <-- 0.32s
        br = x2 * brpp - brp2 + coef[j];  // <-- 3.49s ***
    }                                     // <-- 0.14s

    return (br - brp2) * .5;              // <-- 0.06s
}                                         // <-- 0.05s

This particular function is deeply embedded within others and the main "kick-off" function that my program calls is called thousands of times.

You can see the standout statement with 3.49s as much higher than all the other statement times. I know there are ways to speed C arithmetic with using multiplication over division when possible. But I don't know much more than that.

Like:

Would it be better to split this statement up into smaller pieces?:

br = x2 * brpp;

br -= brp2;

br += coef[j];

Any other ideas or critiques. I did not write this code, though I did add the const to the function parameters as I love const-correctness.

I've never tried using registers or other fancy tricks to speed things up before. Anyone think something like that can work here?

I know people will say, "Try it!" So I will, and will update what I get if it helps anyone with similar arithmetic questions.

EDIT: Posting Results I've Tested From Suggestions

In order from fastest to slowest, here are what I've found so far. Profiler is VerySleepy. Compiler is Visual Studio 2008 Pro Ed. Compile options for both library and my application are:

Debug, C7 format, /O2 /Ob2 /Oi /Ot /Oy /GT /GL /GF /FD /MTd /GS- /Gy /fp:fast /FAs

The following is Andrew's suggestion about doing "4 iterations per loop". It was the fastest so far.

TOTAL TIME spent in function (times from the other statements in the function are not shown here) = 2.08 seconds

for (; index >= 3; index -= 4) {                    // 0.02s
    brp2    = brpp;
    brpp    = br;                                   // 0.02s
    br      = x2 * brpp - brp2 + coef[index];       // 0.25s
    brp2    = brpp;
    brpp    = br;                                   // 0.13s
    br      = x2 * brpp - brp2 + coef[index - 1];   // 0.33s
    brp2    = brpp;
    brpp    = br;                                   // 0.13s
    br      = x2 * brpp - brp2 + coef[index - 2];   // 0.34s
    brp2    = brpp;
    brpp    = br;                                   // 0.14s
    br      = x2 * brpp - brp2 + coef[index - 3];   // 0.42s
}

for (; index >= 0; --index) {                 // 0.03s
    brp2    = brpp;                           // 0.03s
    brpp    = br;
    br      = x2 * brpp - brp2 + coef[index]; // 0.11s
}

The next fastest was the original unaltered code, with a total time of 2.39 seconds inside the function, again including the statements outside the loop. Note that this is less than my original post. My original post was unoptimized code, but since everyone suggested it, all of my tests were subsequently as optimized as I could get in VS08:

for (j = ncf - 1; j >= 0; j--) {      // 0.02s
    brp2 = brpp;                      // 0.03s
    brpp = br;                        // 0.07s
    br = x2 * brpp - brp2 + coef[j];  // 2.14s
}

After this original code, the next fastest was Drew's idea of setting the pointer in advance and using that. Total time spent inside function was 2.49 seconds, including times from statements outside loop:

for (; index >= coef; --index) {         // 0.01s
    brp2    = brpp;
    brpp    = br;                        // 0.06s
    br      = x2 * brpp - brp2 + *index; // 2.24s
}

I also tried a mix of both Andrew's loop unrolling and Drew's pointer usage, but that took 2.39 seconds, the same as the unaltered code.

Based on the results, the loop-unrolling is the way to go so far for my usage.

Looks like you have some form of [Horner's Method](http://en.wikipedia.org/wiki/Horner_scheme) going on? — Mysticial, Dec 31 '11 at 19:45
The prototype should be `swi_echeb(double x, const double *coef, int ncf)`. The additional `const` qualifiers do NOT improve the const-correctness of your code, they will merely serve to startle and annoy people who read your code. — Dietrich Epp, Dec 31 '11 at 19:45
Lots of good suggestions here so far. There's unrolling the loop, forward-incrementing the index, using the `-O3` optimization flag, etc. Could you test each of these independently for performance and report back with how much impact each had? That kind of information would be very useful for posterity. — chrisaycock, Dec 31 '11 at 19:57
@MarkElliot: Yes optimized, though I'm using Visual Studio 08 now, with /O2, /Ot, /Oy, etc. Though the first time I tried this, what I posted above, was not optimized. I'm new to profiling in general and so am learning about what to do and not to do. So I will optimize for profiling from now on in general. — Ned, Dec 31 '11 at 19:57
@Mysticial: Could be Horner's Method. I'm not a math guy myself (aside from what I need for software dev), but the comments above this function do say: "Evaluates a given chebyshev series coef[0..ncf-1] with ncf terms at x in [-1,1]. Communications of the ACM, algorithm 446, April 1973 (vol. 16 no.4) by Dr. Roger Broucke.". — Ned, Dec 31 '11 at 19:59
@DietrichEpp: A good idea. Sometimes I'm divided over this though. Even though const primitive copies can't alter their original values back from the caller function, I wonder if the const helps in case the coder accidentally tries to alter one of them, and the resulting compiler error will let them know that they shouldn't be altering an input-only parameter. — Ned, Dec 31 '11 at 20:04
@user11234607, take a look at this document: edp.org/work/Construction.pdf Its specific to FFT implementations on Altivec processors but has some very interesting performance tuning tips which are applicable across compilers and CPU architectures. — Dr. Andrew Burnett-Thompson, Dec 31 '11 at 20:10
@DietrichEpp: Agreed that they should be in the prototype (forward declaration). But we see only the definition, and using `const` on local variables, including parameters, is a good practice. — Ben Voigt, Dec 31 '11 at 20:11
@ChrisA.: see the comment below under JSPerfUnkn0wn's answer. — Ned, Dec 31 '11 at 20:32
If you don't have it enabled, can you set the arch:sse2 option in vs2008. This is available in the properties window as special instruction set sse2. In addition, did you try register preloading? — Dr. Andrew Burnett-Thompson, Jan 01 '12 at 08:16
@user1124607, any update on this? I'm interested to know how your experiments have worked out. Regards, — Dr. Andrew Burnett-Thompson, Jan 05 '12 at 13:23

Dr. Andrew Burnett-Thompson · Answer 1 · 2012-01-01T08:28:51.927

9

First thing I would try would be to iterate in steps of 4, ie: j+=4 (or in your case, j -=4) and semi-unroll the loop. The reason for this is it will help the compiler to make SSE optimisations and to batch memory access from main memory to cache. Just be aware that you will have to cater for the last few elements in case the loop count is not divisible by 4. For example:

// Disclaimer: I have not tested this code!
for (; j >= 3; j -= 4) {              
    brp2 = brpp;                      
    brpp = br;                        
    br = x2 * brpp - brp2 + coef[j]; 
    brp2 = brpp;                      
    brpp = br;                        
    br = x2 * brpp - brp2 + coef[j-1]; 
    brp2 = brpp;                      
    brpp = br;                        
    br = x2 * brpp - brp2 + coef[j-2]; 
    brp2 = brpp;                      
    brpp = br;                        
    br = x2 * brpp - brp2 + coef[j-3]; 
}                          
// if (j % 4) != 0 before the loop operation, 
// handle 1, 2 or 3 remaining elements here

Second thing I would try would be to preload coeff[j] into a register immediate prior to the calculation. The reason for this is floating point calculations are pipelined, meaning that a memory access in the wrong place can have adverse effects on performance. The calculation itself can be very fast but might take 14 instructions just to queue up the data from cache into the FPU. Add to that an access from main memory it can get even worse. For instance, try this (could also be tried with and without the -=4 unrolling)

// Disclaimer: I have not tested this code!
register double coef1, coef2, coef3, ceof4;
for (; j >= 3; j -= 4) {           
    coef1 = coef[j];    // Preloads the 4 sequential coeffs from 
    coef2 = coef[j-1];  // main memory to cache (if available)
    coef3 = coef[j-2];  
    coef4 = coef[j-3];  
    brp2 = brpp;                      
    brpp = br;                        
    br = x2 * brpp - brp2 + coef1; 
    brp2 = brpp;                      
    brpp = br;                        
    br = x2 * brpp - brp2 + coef2; 
    brp2 = brpp;                      
    brpp = br;                        
    br = x2 * brpp - brp2 + coef3; 
    brp2 = brpp;                      
    brpp = br;                        
    br = x2 * brpp - brp2 + coef4; 
}

In this case the variables double x2, br, brp2, brpp, coef1, coef2, coef3, coef4 should be registers if at all possible.

Finally, using the above, can you apply SSE/SSE2 optimisation to it? Make sure this is enabled in the GCC compiler (I'm used to VS so the equivalent would be Release mode on, debug symbols off, optimization on, SSE2 on) and benchmark your code without the debugger attached. This alone can have a dramatic affect on performance.

Let us know the results. Performance tuning is trial and error!

Best of luck,

edited Jan 01 '12 at 08:28

answered Dec 31 '11 at 19:34

Dr. Andrew Burnett-Thompson

20,980
8
88
178

1

Thanks. I love to try all of these suggestions and get it down on paper for future reference. I'm starving and going to pick up some food, but will do so this afternoon. – Ned Dec 31 '11 at 20:19
Hehe, performance tuning will make you go insane! As others are suggesting, algorithmic improvements will almost always beat implementation tuning. Is there something you can do to reduce the memory access or number of multiplies (or even the dataset being operated on). From that pdf I posted in the comment to your Q, note he chooses implementations that fit entirely in the cache and unrolls them. The result is a huge improvement! – Dr. Andrew Burnett-Thompson Dec 31 '11 at 20:37
2

Careful... you'll want to replace `j >= 0` with `j >= 3` in both those cases or you'll have negative array indices. – Drew Dormann Dec 31 '11 at 20:51
Also, I think both uses of your `coef[j-4]` is a typo. Did you mean `coef[j-3]`? – Drew Dormann Dec 31 '11 at 20:53
Yes, a typo! The first (j >= 0) was just bad coding. I'll update my answer. Thanks :) – Dr. Andrew Burnett-Thompson Dec 31 '11 at 20:56
2

+1: one thing though, wouldn't loop unrolling be attempted during optimization phase? Would it also be helpful to share the disassembly of the compiled code with maximum optimizations too? ... but good call also on preloading – Keldon Alleyne Dec 31 '11 at 22:35
Yes it should, but often I've seen good speedups doing the above (or similar). Perhaps the compiler is able to spot the optimisation in some cases but not others? Thinking about this problem some more, i'm wondering if read from memory is the bottleneck here? The calculation is hardly complex. Perhaps reading a block of N(>32) elements into cache then operating on them at once would be better? how many registers are there on modern x86 CPUs? – Dr. Andrew Burnett-Thompson Dec 31 '11 at 23:27
1

My experience with modern compilers has been that they are quite bad at automatically generating vectorized SSE code (GCC likes to use the *scalar* SSE ops, but that's not the same thing). You may need to organize your data into lumps of four and use the SSE intrinsic functions to get the compiler to issue the right opcodes. – Crashworks Jan 01 '12 at 00:01
Loop unrolling won't help much here. The code can't be trivially vectorized, since there are cross loop dependencies. It's also not likely to be memory bound on recent desktop CPUs, since the main memory bandwidth is sufficient to feed a single scalar math pipe, especially one that is likely stalled most of the time, like in this situation. – Crowley9 Jan 01 '12 at 00:07
Gosh - just updating here that I've been testing various suggestions including yours above. Been running into a lot of walls, unexpected output, when I optimize code and try to keep debug symbols for fixing it, the values often look strange or are unavailable. But I'm getting there. – Ned Jan 01 '12 at 01:31
So far out of everything I've tried on this page, your loop idea yields the fastest result by a noticeable margin. The registers idea though, actually increased the time. When I get everything set I will post all results. – Ned Jan 01 '12 at 02:09
Told you you'd go insane ;) Trial and error and unexpected results are par for the course. I'm surprised about the registers. Do you have arch:sse2 option enabled? Turn off debug symbols in release mode. You can debug it in debug mode. Keep going. What happens if you unroll to 8? My experience agrees with Crashworks, that compilers are awful at generating sse code and if you want performance tuning, do it yourself! – Dr. Andrew Burnett-Thompson Jan 01 '12 at 08:21
So Ebo and myself pointed out the most likely petformance limiter and I suggested a way around it - any chance you tried that? – Crowley9 Jan 01 '12 at 09:29

chrisaycock · Answer 2 · 2011-12-31T19:47:46.403

9

This appears to be a cache issue, not an arithmetic one.

for (; j >= 0; --j) {
    ...
    ... coef[j];
}

You're accessing an array here, and you are decrementing an index to do so. This action can really disrupt the cache-friendly locality inherent in a simple loop.

Is it possible to count forward? Ie,

for (int i = 0; i <= j; i++) {
    ...
    ... coef[i];
}

Would your calculation be valid?

edited Dec 31 '11 at 19:47

answered Dec 31 '11 at 19:41

chrisaycock

36,470
14
88
125

1

_Storing_ the array backwards would be worth a try as well - this would allow for _reading_ the array forwards. – bdonlan Dec 31 '11 at 20:10
3

No, that transformation doesn't preserve the result, although one could reverse the `coef` array. But reverse-iteration has the same locality as forward iteration, and the same number of cache line reads. I suppose a really bad predictive caching algorithm might be confused and not prefetch optimally, but I doubt it. – Ben Voigt Dec 31 '11 at 20:14
@Ben What predictive algorithm does a modern i7 use? The last time I checked, the x86 archs' cache predictors could work going forwards, but not backwards. This was five years ago, though. – Crashworks Jan 01 '12 at 00:00
Crashworks: I'm not sure. But in any case, prefetching only affects about 1/16 of iterations. Once you have good locality, you should probably be worrying more about things like cache coherency more than prefetching. – Ben Voigt Jan 01 '12 at 00:19
@Crashworks: Even Core2 could prefetch multiple streams in either direction. – Stephen Canon Jan 01 '12 at 03:54

Drew Dormann · Answer 3 · 2012-01-01T17:53:33.493

7

I've never tried using registers or other fancy tricks to speed things up before. Anyone think something like that can work here?

There's a very easy register trick that anyone can do. Build the project for a recent CPU. Does this code need to run on a computer from 1995? 2000? 2005? If the program can count on a newer CPU, it can count on having more registers at its disposal.

Also, The integer indexing is unnecessary. You could instead make j a pointer directly to the double of interest. This may make a difference if your optimizing compiler isn't already doing it.

double swi_echeb(const double x, const double* const coef, const int ncf)
{
    const double *j = &coef[ncf - 1];
    // (stuff...)

    while (true) { 
        // (stuff...)

        br = x2 * brpp - brp2 + *j;
        if ( j == coef )
            break;
        --j;
    }  
}

edited Jan 01 '12 at 17:53

answered Dec 31 '11 at 20:15

Drew Dormann

59,987
13
123
180

@Dietmar Kühl: I'm afraid I don't follow what you're saying. This example will iterate from `coef[ncf-1]` to `coef[0]`, inclusive. – Drew Dormann Jan 01 '12 at 02:33
Technically, Dietmar is right (a pointer one-beyond is acceptable, a pointer one-before not). Could be fixed by adding a `if (j==coef) break;` at the end of the loop. This may seem ugly, but the compiler will probably combine the two loop conditions anyway. – wildplasser Jan 01 '12 at 13:52
@wildplasser: So, in the C standard, the very existence of an invalid pointer is illegal? Even if it's not dereferenced/used? Isn't a NULL pointer an invalid pointer? Special case? – Drew Dormann Jan 01 '12 at 15:12
Yes. a *valid* pointer is either a NULL pointer, or it points to a valid object (such as an array element), or -in the case of the array element- one-beyond the array. A null pointer or a one-beyond pointer can not be dereferenced, only compared to other *valid* pointers, or to the NULL pointer constant. – wildplasser Jan 01 '12 at 17:09
I'm confused - isn't this going to finish with j == coef, regardless? Or has the code been changed since these comments were made? – Crowley9 Jan 01 '12 at 18:13

score 4 · Answer 4 · answered Dec 31 '11 at 21:40

The main 'problem' with that code is that you have a critical path along br. You can not begin to calculate the next iteration before you have completely finished the previous one. This also prohibits the of vector instructions: There is nothing to vectorize.

I have the impression that the number coefficients is always rather (single digit?) and the runtime stems from the amount of calls to that function.

One way to mitigate that is to calculate evaluate multiple polynomials at once. Of course this depends on a special layout of your data structures: The coefficients of a certain degree have to be in a linear array, so they can be loaded by a single vector instruction.

score 3 · Answer 5 · answered Dec 31 '11 at 19:42

Well, unless there are some special issues -- like is your array of coefficients big enough that you could be swapping? -- you're probably pretty close.

Andrew's notion of loop unrolling should be tried.
Definitely make sure you have the optimization turned up with -O3

After that, you're going to need to look at the assembly language, or parallelism.

score 3 · Answer 6 · answered Dec 31 '11 at 23:56

This operation is a slight variation of a prefix sum/scan. (It's a 3-op, 2-history scan). The key limiter to performance here is more than likely the serialization (of your math ops in the instruction pipe) caused by the cross loop dependencies, so serial loop unrolling is unlikely to help much here.

There are standard ways to parallelize prefix sums (see wikipedia), that could be used to accelerate this. Even with one thread, you would be able to greatly improve your efficiency by subdividing the coefficient array into 4 sub arrays, and computing the prefix some for each of them per loop iteration - the four streams of computation are independent, and will be properly pipelined by your hardware. Furthermore, since they are independent, you (or your compiler if you are luckly, but I doubt it) can utilize SSE or AVX on an x86 to process the array in parallel.

Once you have your four accumulated results (the results will likely be pairs since you have a 2-history prefix sum), you can combine them in the mathematically appropriate way for your sequence.

+1. In my experience tuned implementations can give a 100% improvement, whereas algorithmic improvements can give 1000%. — Dr. Andrew Burnett-Thompson, Jan 01 '12 at 09:33
That is often true, however, over here the latency serialization is likely costing 2-3x perf and the pack of vectorization is costing 4x, so you can likely get 1100%. Furthermore, when answering these questions I usually assume that the poster has already exhausted the algorithmic avenues. (if not, they should be shot for wasting everyone's time :-) ) — Crowley9, Jan 01 '12 at 09:56

score 2 · Answer 7 · answered Dec 31 '11 at 19:46

2

What are typical values for ncf? Main reason I ask is that you are iterating coef backwards. Non-sequential access is not make good use of the cache.

answered Dec 31 '11 at 19:46

Keldon Alleyne

2,103
16
23

NCF seems to vary greatly. Sometimes 1, sometimes -1555361579, sometimes 798114699. It looks all over the place. Would a forward traversal make sense if NCF is sometimes hugely negative? – Ned Dec 31 '11 at 20:07
1

This is sequential access. Same number of cache line misses. – Ben Voigt Dec 31 '11 at 20:15
Gah. Have to keep correcting myself. NCF is only negative when it's a dummy/uninitialized value and also when coef is also NULL. Makes sense. – Ned Dec 31 '11 at 20:17
1

@BenVoigt: Please do correct me if I'm wrong but [Optimizing C++](http://en.wikibooks.org/wiki/Optimizing_C%2B%2B/Writing_efficient_code/Memory_access) does state the use of accessing memory in increasing address order. I've not done x86 optimization for donkey years mind you, I mostly code for embedded (where accessing in reverse fails prefetch) so please forgive me if my understanding is out of synch. Or perhaps should I have mentioned prefetch? – Keldon Alleyne Dec 31 '11 at 22:31
@JSPerfUnkn0wn: Modern processors (even many "embedded" processors) are equally good at streaming prefetch in either direction. – Stephen Canon Dec 31 '11 at 23:03
@StephenCanon: Not the one I'm working with - says so in their documentation, but I'll take note of what you said for when I have to get intimate with other platforms – Keldon Alleyne Dec 31 '11 at 23:07
@JSPerfUnkn0wn: right; I don't mean to imply that all processors can. The most limited CPUs can't do streaming prefetch in either direction, after all. – Stephen Canon Dec 31 '11 at 23:21

score 2 · Answer 8 · answered Dec 31 '11 at 22:54

This is a case where the profiler finds what you would expect. Look at the code: you've got some loop setup ("oh, that runs once, it won't take up anything"), and a Loop. Inside the loop, you've got two assignments ("nope, those are cheap") and precisely one line of code that does a multiply, two additions, and an array reference.

You're not going to get this function to run a lot faster with micro-optimizations. The processor is actually spending its time doing the work that you want the function to do-- yeah, I know, shocking.

Your best bet is to go up a level or two. How can you reduce the number times this function is called? Is it getting called with the same parameters multiple times, so that you can cache the results? Are there places where you can use fewer coefficents, reducing the number of times the loop runs?

score 2 · Answer 9 · edited May 23 '17 at 12:12

If you're going with Drew's suggestion of using a pointer rather than an array indexing, it might be necessary to restrict the pointer to see any benefit:

...
double *restrict index = coef + ncf - 1;
...
for (; index >= coef; --index) {
    brp2    = brpp;
    brpp    = br;
    br      = x2 * brpp - brp2 + *index;
}

This might help cache optimization, because the compiler can be certain that nobody will change the value that index is pointing to.

Also, I posted a similar problem last year, which got a number of great answers. Be sure to have a look.

score 0 · Answer 10 · answered Jan 02 '12 at 08:48

Do you need the full precision of a double? Moving to float instead could save some time.

Although the optimizer should figure it out, adding an explicit register hint to your variable declarations couldn't hurt:

register double x2, br, brp2, brpp;

You could also try moving coef into a register:

register double* rc;
rc = coef;
. . .
br = x2 * brpp - brp2 + rc[j];

I don't know about this case in particular, but I have seen a surprising number of cases where the compiler doesn't correctly optimize compound expressions. You may be able to increase the odds of it doing the right thing by breaking it down to simple two-component expressions:

brp2 = brpp;     
brpp = br;
br = x2 * brpp;
br -= brp2;
br += rc[j];

You might have a look at the generated code, to see if there are any obvious inefficiencies.

C - have a simple loop that does arithmetic calculation; profiler shows this is a bottleneck. How to speed it up?

10 Answers10