3

Multi-thread program with heavy kernel work( lock unlock), try to find the hot spot at my code use profiling tools.

I wrote a minimum reproducible program. Compile it. Use profiling tool such as perf , gprof with callstack record. None of then can report the hot spot at my code, but report function at kernel or libstdc++.so.

#include <iostream>
#include <mutex>
#include <thread>
#include <memory>
#include <vector>

std::mutex mtx;
long long_val = 0;

void do_something(long &val)
{
    std::unique_lock<std::mutex> lck(mtx);
    for(int j=0; j<1000; ++j)
        val++;
}


void thread_func()
{
    for(int i=0; i<1000000L; ++i)
    {
        do_something(long_val);
    }
}


int main(int argc, char* argv[])
{
    std::vector<std::unique_ptr<std::thread>> threads;
    for(int i=0; i<100; ++i)
    {
        threads.push_back(std::move(std::unique_ptr<std::thread>(new std::thread(thread_func))));
    }
    for(int i=0; i<100; ++i)
    {
        threads[i]->join();
    }
    threads.clear();
    std::cout << long_val << std::endl;
    return 0;
}

As we can se, do_something() is my program's hot spot. because too much of unique_lock cost lots of kernel work, and 1000 times of inc operate at the function.

I want profiling tool tell me that, It is do_something() make a lot of lock, cause lots of kernel work. but from perf, I only can found it is lock slow down my code, but where made these lock operates at my code? it is not clear.

Compile:

g++ -o a.out -O3 -std=c++11 -fno-omit-frame-pointer -pthread -pg main.cpp

Run program./a.out, htop report kernel taken half of CPU resource.

Perf record:

perf record -g ./a.out

Perf report:

      Children      Self  Command  Shared Object        Symbol
    -   98.64%     0.00%  a.out    libstdc++.so.6.0.21  [.] 0x00000000000b8c80
       - 0xb8c80
          + 41.80% __lll_unlock_wake
          + 35.39% __lll_lock_wait
        12.77% pthread_mutex_lock
        7.69% pthread_mutex_unlock
        0.70% _Z11thread_funcv
    -   49.21%     0.88%  a.out    [kernel.kallsyms]    [k] entry_SYSCALL_64_after_hwframe
       + 48.33% entry_SYSCALL_64_after_hwframe
         0.88% 0xb8c80
    -   47.64%     1.06%  a.out    [kernel.kallsyms]    [k] do_syscall_64
       + 46.59% do_syscall_64
       + 1.06% 0xb8c80
    -   47.18%     1.44%  a.out    [kernel.kallsyms]    [k] sys_futex
       + 45.74% sys_futex
       + 1.44% 0xb8c80
    -   45.55%     1.65%  a.out    [kernel.kallsyms]    [k] do_futex
       + 43.90% do_futex
       + 1.65% 0xb8c80
    -   41.80%     1.30%  a.out    libpthread-2.23.so   [.] __lll_unlock_wake
       + 40.50% __lll_unlock_wake
       + 1.30% 0xb8c80
    -   35.39%     4.53%  a.out    libpthread-2.23.so   [.] __lll_lock_wait
       + 30.86% __lll_lock_wait
       + 4.53% 0xb8c80
    -   20.51%     5.40%  a.out    [kernel.kallsyms]    [k] futex_wake
       + 15.11% futex_wake
       + 5.40% 0xb8c80
    -   19.41%     0.88%  a.out    [kernel.kallsyms]    [k] futex_wait
       + 18.53% futex_wait
       + 0.88% 0xb8c80
    -   16.92%     6.25%  a.out    [kernel.kallsyms]    [k] _raw_spin_lock
       + 10.67% _raw_spin_lock
       + 6.25% 0xb8c80
    -   12.77%    12.73%  a.out    libpthread-2.23.so   [.] pthread_mutex_lock
         12.73% 0xb8c80
        pthread_mutex_lock
    +   11.80%     5.41%  a.out    [kernel.kallsyms]    [k] futex_wait_setup
    +   11.05%    11.05%  a.out    [kernel.kallsyms]    [k] syscall_return_via_sysret
    +   10.68%    10.67%  a.out    [kernel.kallsyms]    [k] native_queued_spin_lock_slowpath
    +    7.69%     7.65%  a.out    libpthread-2.23.so   [.] pthread_mutex_unlock
    +    7.36%     0.11%  a.out    [kernel.kallsyms]    [k] wake_up_q
    +    7.17%     1.27%  a.out    [kernel.kallsyms]    [k] try_to_wake_up
    +    4.42%     0.23%  a.out    [kernel.kallsyms]    [k] futex_wait_queue_me
    +    4.17%     0.08%  a.out    [kernel.kallsyms]    [k] schedule
    +    3.84%     0.75%  a.out    [kernel.kallsyms]    [k] __schedule
    +    2.70%     0.01%  a.out    [kernel.kallsyms]    [k] ttwu_do_activate
    +    2.49%     2.48%  a.out    [kernel.kallsyms]    [k] get_futex_value_locked
    +    2.38%     0.11%  a.out    [kernel.kallsyms]    [k] activate_task
    +    2.26%     0.15%  a.out    [kernel.kallsyms]    [k] enqueue_task_fair
    +    1.88%     1.88%  a.out    [unknown]            [k] 0xfffffe000000601b
    +    1.84%     0.15%  a.out    [kernel.kallsyms]    [k] __task_rq_lock
    +    1.78%     1.78%  a.out    [unknown]            [k] 0xfffffe000005e01b
    +    1.77%     1.77%  a.out    [unknown]            [k] 0xfffffe000003201b
    +    1.67%     0.02%  a.out    [kernel.kallsyms]    [k] deactivate_task
    +    1.66%     1.66%  a.out    [unknown]            [k] 0xfffffe000008a01b  

How can I use a profiling tool, to figure out this type hot spot at my really program. I don't want use "binary chop comment" to solve it every time. Thanks!

Bin Zhou
  • 71
  • 7
  • Have you tried a GUI on top of perf? Or VTune? – Matthieu Brucher Jan 21 '19 at 12:57
  • @MatthieuBrucher I profiling program at remote server without X via ssh. We have no VTune license, havn't tryed it. – Bin Zhou Jan 21 '19 at 13:04
  • 2
    Try using perf with [FlameGraph](https://github.com/brendangregg/FlameGraph) (yeah, it is exceptionally ugly perl gibberish, however it does produce nice visualizations). – user7860670 Jan 21 '19 at 13:17
  • @VTT but it seems that perf doesn't make kernel function call as userspace function's children. I tryed `hotspot` gui, doesn't help. – Bin Zhou Jan 21 '19 at 13:34
  • Doesn't their [example](http://www.brendangregg.com/FlameGraphs/cpu-bash-flamegraph.svg) show exactly this - stacks with kernel function calls from userspace functions? – user7860670 Jan 21 '19 at 13:40
  • It seems someone has found the solution to make gprof work with multithreaded applications: http://sam.zoy.org/writings/programming/gprof.html – Oliv Jan 21 '19 at 13:41
  • This solution is already given on so: https://stackoverflow.com/a/922478/5632316 – Oliv Jan 21 '19 at 13:44
  • Possible duplicate of [Using gprof with pthreads](https://stackoverflow.com/questions/922475/using-gprof-with-pthreads) – Oliv Jan 21 '19 at 13:44
  • @Oliv hook pthread function is not acceptable. And gprof is not my key point, but how to make `perf report` mark kernel function as userspace function's children is. – Bin Zhou Jan 21 '19 at 14:04
  • @VTT Thanks, I will have a try. – Bin Zhou Jan 21 '19 at 14:22
  • @BinZhou I have been able to get the information you need. You must change perf command line `perf record --call-graph dwarf ./a.out` (you may need to pass -g at the compiler or not to get debug info in the object file) then `perf report --call-graph --stdio -G`. – Oliv Jan 21 '19 at 14:28
  • @BinZhou any luck ? – hudac Dec 16 '19 at 12:38

1 Answers1

1

perf uses frame pointer unwinding by default, which is fast, but unreliable. Even if your application does not omit frame pointers, the libraries may.

Alternatively you can use --call-graph with either dwarf or lbr.

dwarf records stack chunks during sampling that are later evaluated. It can give more details, but creates huge traces and may have more perturbation.

lbr requires hardware support available at newer Intel processors.

It will show something like this:

99.25%     2.61%  a.out    a.out                [.] thread_func
        |          
        |--97.67%--thread_func
        |          |          
        |          |--45.33%--pthread_mutex_lock@plt
        |          |          
        |          |--28.65%--pthread_mutex_unlock@plt
        |          |          
        |          |--18.88%--__pthread_mutex_lock
        |          |          __lll_lock_wait
        |          |          
        |           --4.75%--__pthread_mutex_unlock_usercnt
        |                     __lll_unlock_wake
        |          
         --0.72%--__pthread_mutex_lock
Zulan
  • 21,896
  • 6
  • 49
  • 109
  • `g++ -o a.out -O3 -std=c++11 -pthread -g main.cpp` `perf record --call-graph lbr ./a.out` `perf report` top3 is `entry_SYSCALL_64_after_hwframe`%48.81 `do_syscall_64`%47.25 `sys_futex`%47.01 – Bin Zhou Jan 22 '19 at 06:38
  • But if compile with `g++ -o a.out -O2 -std=c++11 -pthread -g main.cpp`, then top3 of `Children` is `_Z11thread_funcv`%79.18 `pthread_mutex_unlock`%57.51 `entry_SYSCALL_64_after_hwframe`%49.56. Could you please help me understand why this happen. – Bin Zhou Jan 22 '19 at 06:42
  • result of `uname -a`: Linux PC 4.15.0-43-generic #46~16.04.1-Ubuntu SMP Fri Dec 7 13:31:08 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux ` – Bin Zhou Jan 22 '19 at 06:46