1

What is a good set of compiler options to turn on/off in order to increase the accuracy of my profiling experiment?

I'm most interested in these compilers: gcc/g++/icc and these profiling tools: Intel Vtune, Linux Perf and Oprofile. Linux OS.

It's known that enabling optimizations (function inlining, loop transformations, etc) may change the order of the instructions, which may cause confusing information (if not incorrect) to be shown in a profiler/debugger. However, if I disable these optimizations I'll be profiling (and later optimizing) a code that was "under-optimized"... so, what are the best practices when compiling for profiling?

JohnTortugo
  • 6,356
  • 7
  • 36
  • 69
  • You could simply compile with `gcc -pg` and use `gprof`. But profiling always disturb the profiled code a lot. Read about [heisenbugs](http://en.wikipedia.org/wiki/Heisenbug). Also, what kind of application are you trying to optimize, and why? – Basile Starynkevitch Feb 03 '14 at 20:04
  • is "-pg" the only think I need to do to get accurate profiling info with the tools I mentioned? I'm not restricted to a specific kind of program. – JohnTortugo Feb 03 '14 at 20:29
  • IMHO *accurate profiling* is an [oxymoron](http://en.wikipedia.org/wiki/Oxymoron). Profiling is always disturbing. You need to accept that. – Basile Starynkevitch Feb 03 '14 at 20:31
  • The idea that the purpose of profiling is to get accurate measurements comes out of thin air. Suppose you get them. What do you do with them even if they are "accurate"? Will they tell you how to make it faster? If the goal is to make the code faster, then you need something that tells you what to fix, not that gives you 3-digit precision of something-or-other. – Mike Dunlavey Feb 03 '14 at 21:56
  • Well I didn't say that I want numeric precision. I'm pretty sure that you guys have already debugged some optimized code and noticed that the order of your instructions were changed... many profilers (for instance VTune) are mislead by such transformations. Looking for precise code annotations isn't something from other world. – JohnTortugo Feb 03 '14 at 22:27
  • @John: And I can't figure out why the common wisdom is that you should only profile compiler optimized code. All it does is make problems harder to find, and if the code calls system functions much, it doesn't even speed up the code enough to care! I'm a contrarian, but I and many other people use [*this method*](http://stackoverflow.com/a/378024/23771), and we get *real results*, not airy wishes. – Mike Dunlavey Feb 04 '14 at 00:57

1 Answers1

1

All the profiling tools rely on the debug information generated by the compiler during the build. As long as the debug information has captured these optimizations (especially inlining), the profiling tool will be able to map it to right source location. For ICC when you build your code with optimization turned on, use the compiler option "-debug inline-debug-info". So in case your function is inlined, it will make sure it will call out the optimization at both the call site as well as callee site (where the function is defined). Below is a simple example which illustrates the same:

#include <iostream>
#include <tbb/tbb.h>
#include <tbb/parallel_for.h>
#include <cstdlib>
using namespace std;
using namespace tbb;
long len = 0;
float *__restrict__ a;
float *__restrict__ b;
float *__restrict__ c;
class Test {
public:
    void operator()( const blocked_range<size_t>& x ) const {
        for (long i=x.begin(); i!=x.end(); ++i ) {
            c[i] = (a[i] * b[i]) + b[i];
        }
    }
};
int main(int argc, char* argv[]) {
    cout << atol(argv[1]) << endl;
   len = atol(argv[1]);
    a = new float[len];
    b = new float[len];
    c = new float[len];
    parallel_for(blocked_range<size_t>(0,len, 100), Test() );
    return 0;
}

Building the above code using the following compiler options emits vectorization report which doesn't map the vectorization report to the right source line:

$ icpc testdebug.cc -c -vec-report2 -O3
tbb/parallel_for.h(127): (col. 22) remark: loop was not vectorized: unsupported loop structure
tbb/parallel_for.h(127): (col. 22) remark: LOOP WAS VECTORIZED
tbb/parallel_for.h(127): (col. 22) remark: loop was not vectorized: unsupported loop structure
tbb/parallel_for.h(127): (col. 22) remark: LOOP WAS VECTORIZED
tbb/parallel_for.h(127): (col. 22) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate
tbb/parallel_for.h(127): (col. 22) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate
tbb/partitioner.h(164): (col. 9) remark: loop was not vectorized: existence of vector dependence

From the above report, we see two "LOOP WAS VECTORIZED" message but maps to parallel_for.h TBB header. There is no report corresponding to the functor we have in our program. Since the functor is invoked within the parallel_for block, the function definition is inlined at parallel_for.h

In order to capture that information, use -debug inline-debug-info compiler option during the build and the vectorization report generated will as shown below:

$ icpc testdebug.cc -c -vec-report2 -O3 -debug inline-debug-info
tbb/partitioner.h(171): (col. 9) remark: loop was not vectorized: unsupported loop structure
testdebug.cc(14): (col. 37) remark: LOOP WAS VECTORIZED
tbb/partitioner.h(164): (col. 9) remark: loop was not vectorized: unsupported loop structure
testdebug.cc(14): (col. 37) remark: LOOP WAS VECTORIZED
tbb/partitioner.h(245): (col. 33) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate
tbb/partitioner.h(265): (col. 52) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate
tbb/partitioner.h(164): (col. 9) remark: loop was not vectorized: existence of vector dependence

From the above report it is clear that the "LOOP WAS VECTORIZED" at testdebug.cc(14).

Anoop - Intel
  • 354
  • 2
  • 11