Optimizing a slow loop

Question

The code looks like this and the inner loop takes a huge amount of time:

#define _table_derive                  ((double*)(Buffer_temp + offset))
#define Table_derive(m,nbol,pos)        _table_derive[(m) + 5*((pos) + _interval_derive_dIdQ * (nbol))]
char *Buffer_temp=malloc(...);

for (n_bol=0; n_bol<1400; n_bol++)  // long loop here
    [lots of code here, hundreds of lines with computations on doubles, other loops, etc]

    double ddI=0, ddQ=0;

    // This is the original code
    for(k=0; k< 100; k++ ) {
            ddI += Table_derive(2,n_bol,k);
            ddQ += Table_derive(3,n_bol,k);
    }
    ddI /= _interval_derive_dIdQ;
    ddQ /= _interval_derive_dIdQ;
    [more code here]
}

oprofile tells me that most of the runtime is spent here (2nd column is % of time):

129304  7.6913 :for(k=0; k< 100; k++) {
275831 16.4070 :ddI += Table_derive(2,n_bol,k);
764965 45.5018 :ddQ += Table_derive(3,n_bol,k);

My first question is: can I rely on oprofile to indicate the proper place where the code is slow (I tried in -Og and -Ofast and it's basically the same).

My second question is: how come this very simple loop is slower than sqrt, atan2 and the many hundred lines of computations that come before ? I know I'm not showing all the code, but there's lots of it and it doesn't make sense to me.

I've tried various optimizer tricks to either vectorize (doesn't work) or unroll (works) but for little gain, for instance:

    typedef double aligned_double __attribute__((aligned(8)));
    typedef const aligned_double* SSE_PTR;
    SSE_PTR TD=(SSE_PTR)&Table_derive(2,n_bol,0);   // We KNOW the alignement is correct because offset is multiple of 8

    for(k=0; k< 100; k++, TD+=5) {
        #pragma Loop_Optimize Unroll No_Vector
        ddI += TD[0];
        ddQ += TD[1];
    }

I've checked the output of the optimizer: "-Ofast -g -march=native -fopt-info-all=missed.info -funroll-loops" and in this case I get "loop unrolled 9 times", but if I try to vectorize, I get (in short): "can't force alignment of ref", "vector alignment may not be reachable", "Vectorizing an unaligned access", "Unknown alignment for access: *(prephitmp_3784 + ((sizetype) _1328 + (long unsigned int) (n_bol_1173 * 500) * 2) * 4)"

Any way to speed this up ?

ADDENDUM: Thanks all for the comments, I'll try to answer here:

yes, I know the code is ugly (it's not mine), and you haven't seen the actual original (that's a huge simplification)
I'm stuck with this array as the C code is in a library and the large array, once processed and modified by the C, gets passed onto the caller (either IDL, Python or C).
I know it would be better using some strucs instead of casting char* to complicated multidimensional double*, but see above. Structs may not have been parts of C specs when this prog was first written (just kidding... maybe)
I know that for the vectorizer it's better to have structs of arrays than arrays of struct, but, sigh... see above.
there's an actual outer loop (in the calling program), so that the total size of this monolithic array is around 2Gb
as is, it takes about 15 minutes to run with no optimization, and one minute after I rewrote some code (faster atan2, some manual aligns inside the array ...) and I used -Ofast and -march=native
Due to constraints changes in the hardware, I'm trying to go faster to keep up with dataflow.
I tried with Clang and the gains were slight (a few seconds), but I do not see an option to get an optimization report such as -fopt-info. Do I have to look at the assembly as the only option to know what's going on ?
the system is a beastly 64-core with 500Gb of RAM, but I haven't been able to insert any OpenMP pragmas to parallelize the above code (I've tried): it reads a file, decompresses it entirely in memory (2Gb), analyses it in sequence (things like '+=') and spits out some results to the calling IDL/Python. All on a single core (but the other cores are quite busy with the actual acquisition and post processing). :(
Useless, thanks for the excellent suggestion: removing ddQ += ... seems to transfer the % of time to the previous line: 376280 39.4835:ddI+=...
which brings us to even better: removing both (hence the entire loop) saves... nothing at all !!! So I guess as Peter said, I can't trust the profiler. If I profile the loopless prog, I get timings more evenly spread out (previously 3 lines only above 1s, now about 10, all nonsensical like simple variables assign).

I guess that inner loop was a red herring from the start; I'll restart my optimization using manual timings. Thanks.

If the `ddQ` line really takes 45% of your time, just commenting it out should give you a massive speedup. Does it? — Useless, Apr 27 '16 at 15:08
The vectorizer's output probably relates to the `offset` component in the definition of macro `_table_derive`. If there is a small enough upper bound on the size needed for this array, then it might make sense to use an ordinary local array (of `double`s), or maybe a local VLA. Otherwise, if you can take out the offset, so that `_table_derive` starts at the beginning of `Buffer_temp`, then that should also resolve any uncertainty about alignment. — John Bollinger, Apr 27 '16 at 15:14
Overall, the whole `Buffer_temp` / offset / casting thing has bad code smell. In fact, most casting has bad code smell all by itself. Making it as easy as possible for the compiler to understand what's going on in your code tends to improve the compiler's ability to optimize. — John Bollinger, Apr 27 '16 at 15:20
`for(k=0; k< 100; k++ ) { ... }` could be re-written to `x = foo(); for(k=100; --k; ) { ddI +=_table_derive[x]; ddQ +=_table_derive[x+1]; x += _interval_derive_dIdQ}` or something like that. — chux - Reinstate Monica, Apr 27 '16 at 15:30
The access pattern to `_table_derived` isn't very cache friendly? Is it a giant table that's being paged-in on demand? Even if it's all in memory, you're likely suffering lots of cache misses. The omitted code may be more cache friendly and thus is relatively fast. Is the overall code actually slow? Or is it just that the optimizer is pointing to the inner loop and saying most of the time spent is here? — Adrian McCarthy, Apr 27 '16 at 18:36
If this can auto-vectorize, try `aligned(16)` or 32. (And allocate with `aligned_alloc`) — Peter Cordes, Apr 27 '16 at 22:27
I share your skepticism of the results. If *oprofile* is like other profilers, you might just get a second opinion. As a sanity check, you might try [*this*](http://stackoverflow.com/a/378024/23771), for which there's a video [*here*](https://www.youtube.com/watch?v=xPg3sRpdW1U). And do it *without compiler optimization*. First tune its guts out manually, then let the optimizer try to shave cycles by scrambling the code. That code-scrambling messes up any tool that deals with lines of source code. — Mike Dunlavey, Apr 28 '16 at 16:42

score 3 · Answer 1 · edited Jun 20 '20 at 09:12

My first question is: can I rely on oprofile to indicate the proper place where the code is slow

Not precisely. As I understand it, cycles often get charged to the instruction that's waiting for inputs (or some other execution resource), not the instruction that's slow to produce inputs or free up whatever other execution resource.

However, in your oprofile output, it's probable that it's actually that final loop. Are there other inner loops inside this outer loop?

Did you profile cache misses? There are counters for many interesting things besides cycles.

Also note that to really understand the performance, you need to look at profile annotations on the asm, not the C. e.g. it's weird that one add accounts for more of the time than the other, but that's probably just an issue of mapping insns to source lines.

re: perf results from commenting out the loop:

So the program doesn't run any faster at all without that inner loop? If the outer loop already touched that memory, maybe you're just bottlenecked on cache misses, and the inner loop was just touching that memory again? Try perf record -e L1-dcache-load-misses ./a.out then perf report. Or the oprofile equivalent.

Maybe the inner-loop uops were stuck waiting to issue until slow stuff in the outer loop retired. The ReOrder Buffer (ROB) size in modern Intel CPUs is around 200 uops, and most insns decode to a single uop, so the out-of-order window is around 200 instructions.

Commenting out that inner loop also means that any loop-carried dependency chains in the outer loop don't have time to complete while the inner loop is running. Removing that inner loop could produce a qualitative change in the bottleneck for the outer loop, from throughput to latency.

re: 15x faster with -Ofast -march=native. Ok, that's good. un-optimized code is horrible, and shouldn't be considered any kind of "baseline" or anything for performance. If you want to compare with something, compare with -O2 (doesn't include auto-vectorization, -ffast-math, or -march=native).

Try using -fprofile-generate / -fprofile-use. profile-use includes -funroll-loops, so I assume that option works best when there is profiling data available.

re: auto-parallelization:

You have to enable that specifically, either with OpenMP pragmas or with gcc options like -floop-parallelize-all -ftree-parallelize-loops=4. Auto-parallelization may not be possible if there are non-trivial loop-carried dependencies. That wiki page is old, too, and might not reflect the state-of-the-art in auto-parallelization. I think OpenMP hints about which loops to parallelize are a more sane way to go than having the compiler guess, esp. without -fprofile-use.

I tried with Clang and the gains were slight (a few seconds), but I do not see an option to get an optimization report such as -fopt-info. Do I have to look at the assembly as the only option to know what's going on ?

The clang manual says you can use clang -Rpass=inline for a report on inlining. The llvm docs say that the name for the vectorization pass is loop-vectorize, so you can use -Rpass-missed=loop-vectorize, or -Rpass-analysis=loop-vectorize to tell you which statement caused vectorization to fail.

Looking at the asm is the only way to know whether it auto-vectorized badly or not, but to really judge the compiler's work you have to know how to write efficient asm yourself (so you know approximately what it could have done.) See http://agner.org/optimize/, and other links in the x86 tag wiki.

I didn't try putting your code on http://gcc.godbolt.org/ to try it with different compilers, but you could post a link if your example makes asm that's representative of what you see from the full source.

Auto-vectorization

for(k=0; k< 100; k++ ) {
        ddI += Table_derive(2,n_bol,k);
        ddQ += Table_derive(3,n_bol,k);
}

This should auto-vectorize, since 2 and 3 are consecutive elements. You would get better cache locality (for this part) if you split up the table into multiple tables. e.g. keep elements 2 and 3 of each group of 5 in one array. Group other elements that are used together into tables. (If there's overlap, e.g. another loop needs elements 1 and 3, then maybe split up the one that can't auto-vectorize anyway?)

re: question update: You don't need a struct-of-arrays for this to auto-vectorize with SSE. A 16B vector holds exactly two doubles, so the compiler can accumulate a vector of [ ddI ddQ ] with addsd. With AVX 256b vectors, it would have to do a vmovupd / vinsertf128 to get that pair of doubles from adjacent structs, instead of a single 256b load, though, but not a big deal. Memory locality is an issue, though; you're only using 2 out of every 5 doubles in the cache lines you touch.

It should probably auto-vectorize even without -ffast-math, as long as you're targeting a CPU with double-precision vectors. (e.g. x86-64, or 32bit with -msse2).

gcc likes to make big prologues for potentially-unaligned data, using scalar until it reaches an aligned address. This leads to bloated code, esp. with 256b vectors and small elements. It shouldn't be too bad with double, though. Still, give clang 3.7 or clang 3.8 a try. clang auto-vectorizes potentially-unaligned accesses with unaligned loads, which have no extra cost when the data is aligned at runtime. (gcc optimizes for the hopefully-rare case where the data isn't aligned, because unaligned loads/store instructions were slower on old CPUs (e.g. Intel pre-Nehalem) even when used on aligned data.)

your char array may be defeating the auto-vectorizer, if it can't prove that each double is even 8B-aligned. Like @JohnBollinger commented, that's really ugly. If you have an array of structs of 5 doubles, declare it that way!

How to write it as an array of structs:

Keep the "manual" multidimensional indexing, but make the base 1D array an array of double, or better of a struct type, so the compiler will assume every double is 8B-aligned.

Your original version also referenced the global Buffer_temp for every access to the array. (Or was it a local?) Any store that might alias it would require re-loading the base pointer. (C's aliasing rules allow char* to alias anything, but I think your cast to a double* before dereferencing saves you from that. You're not storing to the array inside the inner loop anyway, but I assume you are in the outer array.)

typedef struct table_derive_entry {
    double a,b,c,d,e;
} derive_t;

void foo(void)
{
    // I wasn't clear on whether table is static/global, or per-call scratch space.
    derive_t *table = aligned_alloc(foo*bar*sizeof(derive_t), 64);            // or just malloc, or C99 variable size array.

    // table += offset/sizeof(table[0]);   // if table is global and offset is fixed within one call...

// maybe make offset a macro arg, too?
#define Table_derive(nbol, pos)     table[offset/sizeof(derive_t) + (pos) + _interval_derive_dIdQ / sizeof(derive_t) * (nbol))]


    // ...        
    for(k=0; k< 100; k++ ) {
         ddI += Table_derive(n_bol, k).b;
         ddQ += Table_derive(n_bol, k).c;
    }
    // ...
}
#undef Table_derive

If _interval_derive_dIdQ and offset aren't always multiples of 5 * 8B, then you may need to declare double *table = ...; and modify your Table_derive to something like

#define Table_derive(nbol, pos)   ( ((derive_t *)(double_table + offset/sizeof(double) + _interval_derive_dIdQ / sizeof(double) * (nbol)))[pos] )

FP division:

ddI /= _interval_derive_dIdQ;
ddQ /= _interval_derive_dIdQ;

Can you hoist double inv_interval_derive_dIdQ = 1.0 / _interval_derive_dIdQ; out of the loop? Multiply is significantly cheaper than divide, esp. if latency matters or the div unit is also needed for sqrt.

@dargaud: updated my answer with an example of how to write it with an array of structs, and several other updates. — Peter Cordes, Apr 28 '16 at 15:31
Woah, thanks for all the details (particularly about Clang reports). It's going to be at least a month before I work on this again, but will post results then. — dargaud, Apr 29 '16 at 13:40