How to profile an optimized code and speed up loops

Question

I know that it doesn't make a lot of sense to profile a code that has been compiled without optimizations but when I try to compile it with -Ofast on and profile it with gprof for example, I get useless data like for example many functions with the same time% and without calls# information:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  Ts/call  Ts/call  name    
 81.35      0.74     0.74                             void cv::Mat::forEach_impl<cv::Vec<unsigned char, 3>, A_estimation(cv::Mat&, std::vector<cv::Mat, std::allocator<cv::Mat> >, int, int)::{lambda(cv::Vec<unsigned char, 3>&, int const*)#1}>(A_estimation(cv::Mat&, std::vector<cv::Mat, std::allocator<cv::Mat> >, int, int)::{lambda(cv::Vec<unsigned char, 3>&, int const*)#1} const&)::PixelOperationWrapper::operator()(cv::Range const&) const
 10.99      0.84     0.10                             void cv::Mat::forEach_impl<cv::Vec<float, 3>, Parallel_process::operator()(cv::Range const&) const::{lambda(cv::Vec<float, 3>&, int const*)#1}>(Parallel_process::operator()(cv::Range const&) const::{lambda(cv::Vec<float, 3>&, int const*)#1} const&)::PixelOperationWrapper::operator()(cv::Range const&) const

When I profile the code which is not optimized I get an easy and useful information from gprof:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 53.86      0.07     0.07    42236     0.00     0.00  A_estimation(cv::Mat&, std::vector<cv::Mat, std::allocator<cv::Mat> >, int, int)::{lambda(cv::Vec<unsigned char, 3>&, int const*)#1}::operator()(cv::Vec<unsigned char, 3>&, int const*) const
 23.08      0.10     0.03  8259774     0.00     0.00  float const& cv::Mat::at<float>(int, int) const
  7.69      0.11     0.01  2812992     0.00     0.00  float& cv::Mat::at<float>(int, int)

That's an example of the code I want to find where it is hot. I found out that it takes 53.86% of the time in that part of the code in which it is called 46945 times:

I extracted that function from my code so as you can compile it:

#include <opencv2/highgui.hpp>
#include <iostream>

typedef std::vector<std::vector<int> > Matrix;

std::vector<int> A_estimation(cv::Mat& src_temp, std::vector<cv::Mat> rgb, int cols, int rows)
{
    //////////////////////////////
    //cv::Mat histSum = cv::Mat::zeros( 256, 1, CV_8UC3 );
    Matrix histSum(3, std::vector<int>(256,0));
    //cv::Mat src_temp = src.clone();
    //src_temp.convertTo(src_temp, CV_8UC3);    
    src_temp.forEach<cv::Vec3b>
    (
      [&histSum](cv::Vec3b &pixel, const int* po) -> void
      {
        ++histSum[0][pixel[0]];
        ++histSum[1][pixel[1]];
        ++histSum[2][pixel[2]];
      }
    );

    std::vector<int> A(3, 255);
    [&A, rows, cols, &histSum]{

        for (auto index=8*rows*cols/1000; index>histSum[0][A[0]]; --A[0])
             index -= histSum[0][A[0]];
        for (auto index=8*rows*cols/1000; index>histSum[1][A[1]]; --A[1])
             index -= histSum[1][A[1]];
        for (auto index=8*rows*cols/1000; index>histSum[2][A[2]]; --A[2])
             index -= histSum[2][A[2]];
        return A;
   }();
    return A;
        //auto AA=A_estim_lambda();
}

int main(int argc, char* argv[])
{
  cv::Mat src_temp = cv::imread(argv[1]);
  auto rows=src_temp.rows,
       cols=src_temp.cols;
  std::vector<cv::Mat> rgb;
  cv::split(src_temp, rgb);
  auto A = A_estimation(src_temp, rgb, cols, rows);

  //Do sth with A
}

Compilation:

g++ -std=c++1z -Wall -Weffc++ -Ofast test.cpp -o test -fopenmp `pkg-config --cflags --libs opencv`

Execution

./test frame.jpg

I've two questions:

Are those information correct because they're taken from non-optimized code and if not how can I compile optimized code? and any tips to how to speed up those loops?

Please post a [MCVE]. Modern compilers produce optimal code for this sort of thing. The best hope for speeding it up is if you can re-factor the data so that consecutive memory accesses tend to hit the same cache line. — Jive Dadson, Feb 23 '18 at 22:29
Profiling non-optimized code is fruitless. You need to profile an actual application, fully optimized. — Jive Dadson, Feb 23 '18 at 22:34
The way to speed up non-optimized code is to optimize it. :-) Otherwise you could become a victim of [The Streetlight Effect](https://en.wikipedia.org/wiki/Streetlight_effect) - searching where it is easy to see things. A guess is that the `Mat::at` accessors do range checking in debug mode and that's why they show up in that list. Making them faster will not help the production code. — Bo Persson, Feb 23 '18 at 22:47
@BoPersson yes that's my question how can I compile/profile it correctly! — Ja_cpp, Feb 23 '18 at 22:51
Most people will not be able to compile that because of the dependency on opencv. — Jive Dadson, Feb 23 '18 at 22:52
Forget gprof, compile with symbols (`-g`), run under `perf record`; you can then examine the results with `perf report` or the great [Hotspot GUI](https://github.com/KDAB/hotspot). — Matteo Italia, Feb 23 '18 at 23:01
Also, careful with `-Ofast`, it can break code that expects IEEE754 math to work according to spec. `-O3` is safer. — Matteo Italia, Feb 23 '18 at 23:04

Mike Dunlavey · Answer 1 · 2018-02-25T21:46:44.547

Are you up for a contrary viewpoint?

First, speedups fall into two categories: things the compiler can fix, and things only you can fix. The optimizer will not fix the things only you can fix. Furthermore, it will not make them stand out, because speedups don't add together, they multiply.

Second, think about numbers - percents or fractions of time. If there is a speedup you can fix (which there is, probably several), it takes a certain fraction of time. Like say 30%.
That means if you can completely remove it, the code will run in 70% of the time it did before. The speedup factor will be 100/70 or 1.43x.
That also means if you just manually pause the program, at random, the chance is 30% that it will be in the process of doing the wasteful thing.
So if you just manually pause it 20 times (not very many) the wasteful thing will appear about 6 of the times. If you see it doing something that could be improved on multiple pauses, no matter how you describe it, you've found a speedup.
Of course, it might have only been wasting 25%, or as much as 35%, but do you care? Hint: No you don't care, because you found the problem, and that's more important than knowing exactly what it costs.

Here's where the fact that you've got multiple speedups really pays off. Each speedup you fix doesn't just save some time. It multiplies two things: It multiplies the total speedup factor, and it multiplies the fraction of time each remaining problem takes, making them stand out more.

That's the reasoning behind this approach.

Once you've fixed your speedups, let the optimizer do its speedups.

How to profile an optimized code and speed up loops

1 Answers1