The code looks like this and the inner loop takes a huge amount of time:
#define _table_derive ((double*)(Buffer_temp + offset))
#define Table_derive(m,nbol,pos) _table_derive[(m) + 5*((pos) + _interval_derive_dIdQ * (nbol))]
char *Buffer_temp=malloc(...);
for (n_bol=0; n_bol<1400; n_bol++) // long loop here
[lots of code here, hundreds of lines with computations on doubles, other loops, etc]
double ddI=0, ddQ=0;
// This is the original code
for(k=0; k< 100; k++ ) {
ddI += Table_derive(2,n_bol,k);
ddQ += Table_derive(3,n_bol,k);
}
ddI /= _interval_derive_dIdQ;
ddQ /= _interval_derive_dIdQ;
[more code here]
}
oprofile tells me that most of the runtime is spent here (2nd column is % of time):
129304 7.6913 :for(k=0; k< 100; k++) {
275831 16.4070 :ddI += Table_derive(2,n_bol,k);
764965 45.5018 :ddQ += Table_derive(3,n_bol,k);
My first question is: can I rely on oprofile to indicate the proper place where the code is slow (I tried in -Og and -Ofast and it's basically the same).
My second question is: how come this very simple loop is slower than sqrt, atan2 and the many hundred lines of computations that come before ? I know I'm not showing all the code, but there's lots of it and it doesn't make sense to me.
I've tried various optimizer tricks to either vectorize (doesn't work) or unroll (works) but for little gain, for instance:
typedef double aligned_double __attribute__((aligned(8)));
typedef const aligned_double* SSE_PTR;
SSE_PTR TD=(SSE_PTR)&Table_derive(2,n_bol,0); // We KNOW the alignement is correct because offset is multiple of 8
for(k=0; k< 100; k++, TD+=5) {
#pragma Loop_Optimize Unroll No_Vector
ddI += TD[0];
ddQ += TD[1];
}
I've checked the output of the optimizer: "-Ofast -g -march=native -fopt-info-all=missed.info -funroll-loops" and in this case I get "loop unrolled 9 times", but if I try to vectorize, I get (in short): "can't force alignment of ref", "vector alignment may not be reachable", "Vectorizing an unaligned access", "Unknown alignment for access: *(prephitmp_3784 + ((sizetype) _1328 + (long unsigned int) (n_bol_1173 * 500) * 2) * 4)"
Any way to speed this up ?
ADDENDUM: Thanks all for the comments, I'll try to answer here:
- yes, I know the code is ugly (it's not mine), and you haven't seen the actual original (that's a huge simplification)
- I'm stuck with this array as the C code is in a library and the large array, once processed and modified by the C, gets passed onto the caller (either IDL, Python or C).
- I know it would be better using some strucs instead of casting char* to complicated multidimensional double*, but see above. Structs may not have been parts of C specs when this prog was first written (just kidding... maybe)
- I know that for the vectorizer it's better to have structs of arrays than arrays of struct, but, sigh... see above.
- there's an actual outer loop (in the calling program), so that the total size of this monolithic array is around 2Gb
- as is, it takes about 15 minutes to run with no optimization, and one minute after I rewrote some code (faster atan2, some manual aligns inside the array ...) and I used -Ofast and -march=native
- Due to constraints changes in the hardware, I'm trying to go faster to keep up with dataflow.
- I tried with Clang and the gains were slight (a few seconds), but I do not see an option to get an optimization report such as -fopt-info. Do I have to look at the assembly as the only option to know what's going on ?
- the system is a beastly 64-core with 500Gb of RAM, but I haven't been able to insert any OpenMP pragmas to parallelize the above code (I've tried): it reads a file, decompresses it entirely in memory (2Gb), analyses it in sequence (things like '+=') and spits out some results to the calling IDL/Python. All on a single core (but the other cores are quite busy with the actual acquisition and post processing). :(
- Useless, thanks for the excellent suggestion: removing ddQ += ... seems to transfer the % of time to the previous line: 376280 39.4835:ddI+=...
- which brings us to even better: removing both (hence the entire loop) saves... nothing at all !!! So I guess as Peter said, I can't trust the profiler. If I profile the loopless prog, I get timings more evenly spread out (previously 3 lines only above 1s, now about 10, all nonsensical like simple variables assign).
I guess that inner loop was a red herring from the start; I'll restart my optimization using manual timings. Thanks.