All the profiling tools rely on the debug information generated by the compiler during the build. As long as the debug information has captured these optimizations (especially inlining), the profiling tool will be able to map it to right source location. For ICC when you build your code with optimization turned on, use the compiler option "-debug inline-debug-info". So in case your function is inlined, it will make sure it will call out the optimization at both the call site as well as callee site (where the function is defined). Below is a simple example which illustrates the same:
#include <iostream>
#include <tbb/tbb.h>
#include <tbb/parallel_for.h>
#include <cstdlib>
using namespace std;
using namespace tbb;
long len = 0;
float *__restrict__ a;
float *__restrict__ b;
float *__restrict__ c;
class Test {
public:
void operator()( const blocked_range<size_t>& x ) const {
for (long i=x.begin(); i!=x.end(); ++i ) {
c[i] = (a[i] * b[i]) + b[i];
}
}
};
int main(int argc, char* argv[]) {
cout << atol(argv[1]) << endl;
len = atol(argv[1]);
a = new float[len];
b = new float[len];
c = new float[len];
parallel_for(blocked_range<size_t>(0,len, 100), Test() );
return 0;
}
Building the above code using the following compiler options emits vectorization report which doesn't map the vectorization report to the right source line:
$ icpc testdebug.cc -c -vec-report2 -O3
tbb/parallel_for.h(127): (col. 22) remark: loop was not vectorized: unsupported loop structure
tbb/parallel_for.h(127): (col. 22) remark: LOOP WAS VECTORIZED
tbb/parallel_for.h(127): (col. 22) remark: loop was not vectorized: unsupported loop structure
tbb/parallel_for.h(127): (col. 22) remark: LOOP WAS VECTORIZED
tbb/parallel_for.h(127): (col. 22) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate
tbb/parallel_for.h(127): (col. 22) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate
tbb/partitioner.h(164): (col. 9) remark: loop was not vectorized: existence of vector dependence
From the above report, we see two "LOOP WAS VECTORIZED" message but maps to parallel_for.h TBB header. There is no report corresponding to the functor we have in our program. Since the functor is invoked within the parallel_for block, the function definition is inlined at parallel_for.h
In order to capture that information, use -debug inline-debug-info compiler option during the build and the vectorization report generated will as shown below:
$ icpc testdebug.cc -c -vec-report2 -O3 -debug inline-debug-info
tbb/partitioner.h(171): (col. 9) remark: loop was not vectorized: unsupported loop structure
testdebug.cc(14): (col. 37) remark: LOOP WAS VECTORIZED
tbb/partitioner.h(164): (col. 9) remark: loop was not vectorized: unsupported loop structure
testdebug.cc(14): (col. 37) remark: LOOP WAS VECTORIZED
tbb/partitioner.h(245): (col. 33) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate
tbb/partitioner.h(265): (col. 52) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate
tbb/partitioner.h(164): (col. 9) remark: loop was not vectorized: existence of vector dependence
From the above report it is clear that the "LOOP WAS VECTORIZED" at testdebug.cc(14).