How to interpret the report of perf

Question

I'm learning how to use the tool perf to profile my c++ project. Here is my code:

#include <iostream>
#include <thread>
#include <mutex>
#include <vector>


std::mutex mtx;
long long_val = 0;

void do_something(long &val)
{
    std::unique_lock<std::mutex> lck(mtx);
    for(int j=0; j<1000; ++j)
        val++;
}


void thread_func()
{
    for(int i=0; i<1000000L; ++i)
    {
        do_something(long_val);
    }
}


int main(int argc, char* argv[])
{
    std::vector<std::unique_ptr<std::thread>> threads;
    for(int i=0; i<100; ++i)
    {
        threads.push_back(std::move(std::unique_ptr<std::thread>(new std::thread(thread_func))));
    }
    for(int i=0; i<100; ++i)
    {
        threads[i]->join();
    }
    threads.clear();
    std::cout << long_val << std::endl;
    return 0;
}

To compile it, I run g++ -std=c++11 main.cpp -lpthread -g and then I get the executable file named a.out.

Then I run perf record --call-graph dwarf -- ./a.out and wait for 10 seconds, then I press Ctrl+c to interrupt the ./a.out because it needs too much time to execute.

Lastly, I run perf report -g graph --no-children and here is the output:

My goal is to find which part of the code is the heaviest. So it seems that this output could tell me do_something is the heaviest part(46.25%). But when I enter into do_something, I can not understand what it is: std::_Bind_simple, std::thread::_Impl etc.

So how to get more useful information from the output of perf report? Or we can't get more except the fact that do_something is the heaviest?

All symbols beginning with an underscore and followed by an upper-case letter (like e.g. `_Bind_simple`) is reserved in all scopes for the "implementation" (compiler and standard library). See [What are the rules about using an underscore in a C++ identifier?](https://stackoverflow.com/questions/228783/what-are-the-rules-about-using-an-underscore-in-a-c-identifier) for details. What that means in your case is that those symbols are internal and private for the "implementation", and are probably internal helper functions or classes of the standard library. — Some programmer dude, Jul 31 '19 at 06:34
You forgot to enable optimization at all when you compiled, so all the little functions that should normally inline away are actually getting called. Add `-O3` or at least `-O2` to your g++ command line. Optionally also profile-guided optimization if you really want gcc to do a good job on hot loops. — Peter Cordes, Jul 31 '19 at 06:36
Yes, the information you get when you expand `do_something` is the call stack. So what you can see here is that `do_something` was called by `thread_func` which was in turn called by `std::_Bind_simple<...>::_M_invoke<>` and so on. — Frodyne, Jul 31 '19 at 06:39
Not directly related, but why use `std::vector> threads;` instead of simple `std::vector threads;`? If issue is calling constructor, you can do it with `thread.emplace_back(thread_func);` — sklott, Jul 31 '19 at 06:41
@PeterCordes oh thanks a lot. It helps. I just tried `-O3` and now `perf report` shows me the hotspot is `futex_wake` and `futex_wait_setup`. This is useful. — Yves, Jul 31 '19 at 06:43
@Frodyne Ok, got it. So what I need is not the call stack of `do_something`... Anyway, in this case, `-O3` helps a lot. — Yves, Jul 31 '19 at 06:45

score 1 · Accepted Answer · answered Jul 31 '19 at 06:55

With the help of @Peter Cordes, I pose this answer. If you have something more useful, please feel free to pose your answers.

You forgot to enable optimization at all when you compiled, so all the little functions that should normally inline away are actually getting called. Add -O3 or at least -O2 to your g++ command line. Optionally also profile-guided optimization if you really want gcc to do a good job on hot loops.

After adding -O3, the output of perf report becomes:

Now we can get something useful from futex_wake and futex_wait_setup as we should know that mutex in C++11 is implemented by futex of Linux. So the result is that mutex is the hotspot in this code.

score 0 · Answer 2 · answered Jul 31 '19 at 06:41

0

The issue here is that your mutexes are waiting on each other forcing your program to hit the scheduler often.

You would get better performance if you used fewer threads.

answered Jul 31 '19 at 06:41

doron

27,972
12
65
103

2

Yeah you are totally right. I think people can figure it out easily by reading the source code. The problem is to make `perf` to tell us this fact. – Yves Jul 31 '19 at 06:49

How to interpret the report of perf

2 Answers2