1

I used Perf to extract call graphs in an evince benchmark. The command used is as follows:

sudo perf record -d --call-graph dwarf -c 10000 -e mem_load_uops_retired.l3_miss:uppp /opt/evince-3.28.4/bin/evince

I extracted the backtraces using Perf script and found out that there are many corrupted backtrace instances. Some contained repeated redundant function calls, such as this:

EvJobScheduler 10021  8653.926478:        100 mem_load_uops_retired.l3_miss:uppp:     7fffd1062a00         5080022 N/A|SNP N/A|TLB N/A|LCK N/A
   7ffff4b07207 tcache_get+0x197 (inlined)
   7ffff4b07207 __GI___libc_malloc+0x197 (inlined)
   7fffd9872fb9 gmalloc+0x59 (inlined)
   7fffd9872fb9 gmallocn+0x59 (/usr/lib/x86_64-linux-gnu/libpoppler.so.73.0.0)
   7fffd9872fb9 gmallocn+0x59 (/usr/lib/x86_64-linux-gnu/libpoppler.so.73.0.0)
   7fffd9951e6f _ZN8TextLine8coalesceEP10UnicodeMap+0xff (/usr/lib/x86_64-linux-gnu/libpoppler.so.73.0.0)
   7fffd9952f82 _ZN9TextBlock8coalesceEP10UnicodeMapd+0x752 (/usr/lib/x86_64-linux-gnu/libpoppler.so.73.0.0)
   7fffd995bc37 _ZN8TextPage8coalesceEbdb+0x1507 (/usr/lib/x86_64-linux-gnu/libpoppler.so.73.0.0)
   7fffd995cb71 _ZN13TextOutputDev7endPageEv+0x31 (/usr/lib/x86_64-linux-gnu/libpoppler.so.73.0.0)
   7fffe803c6d2 _ZL26poppler_page_get_text_pageP12_PopplerPage+0x92 (/usr/lib/x86_64-linux-gnu/libpoppler-glib.so.8.9.0)
   7fffe803deb3 poppler_page_get_selection_region+0x63 (/usr/lib/x86_64-linux-gnu/libpoppler-glib.so.8.9.0)
   7fffe82ab650 [unknown] (/opt/evince-3.28.4/lib/evince/4/backends/libpdfdocument.so)
   7ffff795f165 ev_job_page_data_run+0x2f5 (/opt/evince-3.28.4/lib/libevview3.so.3.0.0)
   7ffff7961309 ev_job_thread+0xe9 (inlined)
   7ffff7961309 ev_job_thread_proxy+0xe9 (/opt/evince-3.28.4/lib/libevview3.so.3.0.0)
   7ffff5492194 g_thread_proxy+0x54 (/usr/lib/x86_64-linux-gnu/libglib-2.0.so.0.5600.4)
   7ffff4e686da start_thread+0xda (/lib/x86_64-linux-gnu/libpthread-2.27.so)
   7ffff4b9188e __GI___clone+0x3e (inlined)

There are two consecutive gmallocs() which is not correct. Some were function calls were callers and callees did not match, such as this:

evince 10015  8640.962182:        100 mem_load_uops_retired.l3_miss:uppp:     7fffdc005030         5080022 N/A|SNP N/A|TLB N/A|LCK N/A
   7ffff5a3275f g_action_get_enabled+0x3f (/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.5600.4)
   7ffff5a2ffcc g_simple_action_group_query_action+0x3c (/usr/lib/x86_64-linux-gnu/libgio-2.0.so.0.5600.4)
   7ffff7130d8c gtk_action_muxer_query_action+0xac (/usr/lib/x86_64-linux-gnu/libgtk-3.so.0.2200.30)
   7ffff7130d4c gtk_action_muxer_query_action+0x6c (/usr/lib/x86_64-linux-gnu/libgtk-3.so.0.2200.30)
   7ffff7130d4c gtk_action_muxer_query_action+0x6c (/usr/lib/x86_64-linux-gnu/libgtk-3.so.0.2200.30)
   7ffff7130d4c gtk_action_muxer_query_action+0x6c (/usr/lib/x86_64-linux-gnu/libgtk-3.so.0.2200.30)
   7ffff7130d4c gtk_action_muxer_query_action+0x6c (/usr/lib/x86_64-linux-gnu/libgtk-3.so.0.2200.30)
   7ffff7130994 gtk_action_muxer_action_added+0x64 (/usr/lib/x86_64-linux-gnu/libgtk-3.so.0.2200.30)
   7ffff574510c g_closure_invoke+0x19c (/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.5600.4)
   7ffff575805d signal_emit_unlocked_R+0xf4d (/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.5600.4)
   7ffff5760714 g_signal_emit_valist+0xa74 (/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.5600.4)
   7ffff576112e g_signal_emit+0x8e (/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.5600.4)
   7ffff71309e2 gtk_action_muxer_action_added+0xb2 (/usr/lib/x86_64-linux-gnu/libgtk-3.so.0.2200.30)
   7ffff574510c g_closure_invoke+0x19c (/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.5600.4)
   7ffff575805d signal_emit_unlocked_R+0xf4d (/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.5600.4)
   7ffff5760714 g_signal_emit_valist+0xa74 (/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.5600.4)
   7ffff576112e g_signal_emit+0x8e (/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.5600.4)
   7ffff71309e2 gtk_action_muxer_action_added+0xb2 (/usr/lib/x86_64-linux-gnu/libgtk-3.so.0.2200.30)
   7ffff574510c g_closure_invoke+0x19c (/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.5600.4)
   7ffff575805d signal_emit_unlocked_R+0xf4d (/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.5600.4)
   7ffff5760714 g_signal_emit_valist+0xa74 (/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.5600.4)
   7ffff576112e g_signal_emit+0x8e (/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.5600.4)
   7ffff71309e2 gtk_action_muxer_action_added+0xb2 (/usr/lib/x86_64-linux-gnu/libgtk-3.so.0.2200.30)
   7ffff574510c g_closure_invoke+0x19c (/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.5600.4)
   7ffff575805d signal_emit_unlocked_R+0xf4d (/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.5600.4)
   7ffff5760714 g_signal_emit_valist+0xa74 (/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.5600.4)
   7ffff576112e g_signal_emit+0x8e (/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.5600.4)
   7ffff71309e2 gtk_action_muxer_action_added+0xb2 (/usr/lib/x86_64-linux-gnu/libgtk-3.so.0.2200.30)
   7ffff7131482 gtk_action_muxer_set_parent+0x1d2 (/usr/lib/x86_64-linux-gnu/libgtk-3.so.0.2200.30)
   7ffff73c46f8 gtk_widget_set_parent+0x198 (/usr/lib/x86_64-linux-gnu/libgtk-3.so.0.2200.30)
   7ffff73d0228 gtk_window_set_titlebar+0xb8 (/usr/lib/x86_64-linux-gnu/libgtk-3.so.0.2200.30)
   55555558d391 ev_window_init+0x2e1 (/opt/evince-3.28.4/bin/evince)
   7ffff57699c4 g_type_create_instance+0x1e4 (/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.5600.4)
   7ffff574a747 g_object_new_internal+0x2e7 (/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.5600.4)
   7ffff574c5bf g_object_new_valist+0x3cf (/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.5600.4)
   7ffff574c938 g_object_new+0x98 (/usr/lib/x86_64-linux-gnu/libgobject-2.0.so.0.5600.4)
   55555558f192 ev_window_new+0x42 (/opt/evince-3.28.4/bin/evince)
   5555555778f4 ev_application_open_recent_view+0x14 (/opt/evince-3.28.4/bin/evince)
   555555573733 load_files+0x473 (inlined)
   555555573733 main+0x473 (/opt/evince-3.28.4/bin/evince)
   7ffff4a91b96 __libc_start_main+0xe6 (/lib/x86_64-linux-gnu/libc-2.27.so)
   555555573899 _start+0x29 (/opt/evince-3.28.4/bin/evince)

This is wrong because:

1) gtk_action_muxer_action_added() calls ‍‍‍g_action_group_query_action() at offset 0x64 (i.e., 100 in decimal):

0x00000000000f898f <+95>:   push   %rax
0x00000000000f8990 <+96>:   callq  0x84e20 <g_action_group_query_action@plt>
0x00000000000f8995 <+101>:  test   %eax,%eax

2) gtk_action_muxer_query_action() calls g_action_group_query_action() at offset 0x6c (i.e., 108 in decimal):

0x00000000000f8d45 <+101>:  mov    %rbp,%rsi
0x00000000000f8d48 <+104>:  callq  0x84e20 <g_action_group_query_action@plt>
0x00000000000f8d4d <+109>:  pop    %rdx

3) gtk_action_muxer_query_action() calls g_action_group_query_action() at offset 0xac (i.e., 172 in decimal):

0x00000000000f8d85 <+165>:  mov    %r12,%rdx
0x00000000000f8d88 <+168>:  callq  0x84e20 <g_action_group_query_action@plt>
0x00000000000f8d8d <+173>:  pop    %rsi

As it can be seen, none of these match the reported backtrace. Does the problem occur when Perf tries to match function names to addresses, or the addresses are wrong, by themselves? How can I revise the problem?

TheAhmad
  • 810
  • 1
  • 9
  • 21
  • 1
    relative `call` is a 5-byte instruction. `<+96>: callq 0x84e20 ` has a return address of `+101` or `+0x65`, not `+100`. i.e. the start of the `test` instruction in the disassembly. I don't have an answer for your actual question, and that matches the backtrace even less, but it looks like a mis-statement of that detail in the question unless I'm forgetting something about how backtraces normally match up with call addresses. – Peter Cordes Mar 19 '20 at 23:13
  • So, is it possible that `Perf` maps the address in a wrong binary? For example, because it didn't trace mappings (i.e., `mmap/unmap()`s, correctly. – TheAhmad Mar 20 '20 at 22:02
  • 1
    The mappings are correct. They match exactly with those of **ordinary** execution. Addresses match symbols (i.e., each **single** row is correct), but as shown above, some callers and callees do **not** match. Perhaps some functions in the middle are missed by `Perf`. Strangely, in all runs, the problem occurs at the **same** place. – TheAhmad Mar 22 '20 at 18:57
  • 1
    Optimizations like inlining are made at compile time; that would result in the runtime-visible part of the callchain definitely skipping some calls that appear in the source. It's not surprising that you get the same thing every execution, but it maybe does rule out some kinds of weirdness. Sorry IDK any more. – Peter Cordes Mar 22 '20 at 23:39
  • **Thanks!** But, e.g., `gtk_action_muxer_query_action+0x6c` can not be a call site of an **inline** function. Because that is a call into **PLT**. And inline functions are typically recognized in the `Perf` output. – TheAhmad Mar 23 '20 at 01:58
  • 1
    Optimized tailcalls can also affect the backtrace. `jmp func@plt` instead of `call func@plt; ret` makes the target look like it was called directly by *your* parent function. – Peter Cordes Mar 23 '20 at 02:19
  • Thanks, Peter Cordes, and I have one last question, not necessarily specific to this topic: **tail-call elimination** at **translation unit boundaries** seems to be a **link-time optimization**. AFAIK, link time optimizations are particularly rare. I'd like to know your opinion on this statement. – TheAhmad Mar 24 '20 at 00:36
  • 1
    I'm not talking about *elimination*; that sounds like laying out the functions head to tail so you can fall through from one to the other without even a `jmp`. Or just inlining. (And yes you'd have to do that at link time, like `gcc -flto`, which you should generally use). I'm just talking about normal *optimizing* of `call x`;`ret` into `jmp x`, which you can do regardless of where `x` is, whether it's a direct rel32 call to the function or to a PLT stub, or if it's a `call/jmp *printf@GOTPCREL(%rip)` (AT&T syntax for `-fno-plt` style of code-gen, using a function pointer from the GOT). – Peter Cordes Mar 24 '20 at 00:39
  • 1
    To be clear, when you want to trace / debug / profile, you might omit `-flto` to disable cross-file inlining. But if your codebase has some small helper functions in `.cpp` files (not `.h`), then letting them inline can open up important optimizations for loops that call them. – Peter Cordes Mar 24 '20 at 00:44
  • First of all, sorry for my lack of knowledge and this long discussion. So, you say that it is common. I do not understand the **first** part. If we are calling an external unknown library, how do we know that inlining will be **beneficial**? And the second part (using `jmp` instead of `call`) seems to be tail call elimiation, isn't it? – TheAhmad Mar 24 '20 at 01:21
  • 1
    Inlining is *not possible* for an external unknown library. (Unless it holds LLVM bytecode or GCC GIMPLE for LTO, instead of / as well as real machine code, in which case the compiler uses the same inlining heuristics as when optimizing within one TU). Re: elimination vs. optimization: yes I think that was a terminology mixup; I was trying to come up with something that would actually require LTO (such as inlining). Apparently optimizing call/ret into jmp is also called tail-call elimination, as a synonym for "optimization" ([wikipedia](https://en.wikipedia.org/wiki/Tail_call)). – Peter Cordes Mar 24 '20 at 01:29
  • And I promise that this is the final question! how can we perform tail-call elimination when the **callee** is an external library? And are you saying that we can directly use `GOT` contents without calling `PLT` entries? – TheAhmad Mar 24 '20 at 01:41
  • 1
    Yes, I said that explicitly in [a previous comment](https://stackoverflow.com/questions/60766026/wrong-perf-backtraces?noredirect=1#comment107613072_60766026) with examples. call/ret can *always* be replaced by jmp if there would be nothing between the call and the ret. Notice that PLT stubs themselves work by tailcalling the final function with jmp. e.g. single-step into one in GDB (with the `si` / `stepi` command. Use `layout reg` to see disassembly). – Peter Cordes Mar 24 '20 at 01:53
  • 2
    re: `gcc -fno-plt` style code-gen that inlines the indirect call instead of calling a PLT stub: see [How can I call printf normally in assembly without @PLT but just call printf with -l option in gcc with standard library,](https://stackoverflow.com/q/56404855) / [32-bit absolute addresses no longer allowed in x86-64 Linux?](https://stackoverflow.com/q/43367427) / https://web.archive.org/web/20171111043629/http://www.macieira.org/blog/2012/01/sorry-state-of-dynamic-libraries-on-linux/. This is orthogonal to backtraces or tailcall optimization. – Peter Cordes Mar 24 '20 at 01:54
  • 1
    Also [Can't call C standard library function on 64-bit Linux from assembly (yasm) code](https://stackoverflow.com/q/52126328) shows NASM syntax for it. – Peter Cordes Mar 24 '20 at 01:56
  • Peter Cordes, you answer was correct. The problems was caused by (not explicitly expressed by `Perf`) **inline functions** and (inter-library) **tail calls**. If you write an answer, I can accept it. Then I will edit your answer and add the details. Is it OK? – TheAhmad Apr 08 '20 at 20:56
  • I think my simplified example in my answer should be fine; probably doesn't need a specific example from your code beyond what's in the question. But yeah, thanks for confirming that was the answer. – Peter Cordes Apr 08 '20 at 21:15

1 Answers1

1

Inlining and optimized tail calls are 2 separate ways for calls in the C source to not appear in the runtime stack backtrace.

An inlined function will result in the calls to its children happening directly from the parent it inlined into, with this function not appearing at all ever.

Multiple levels of inlining are of course possible, and common with small functions.


A function that ends with an optimized jmp tailcall makes it look like the callee was called directly from its parent. Instead of creating a new stack frame below its own, it tears down its own stack frame and jumps to the target function, with the return address still there. So that new function's stack frame will use the same memory that was previously used by the function that made the tailcall.

To put it another way, any call func; ret sequence can be replaced by jmp func. This works even if the jmp target is a (plt stub for) a library function, or an indirect call/jmp *printf@GOTPCREL(%rip) target for gcc -fno-plt style codegen.

See also https://en.wikipedia.org/wiki/Tail_call

e.g.

void leaf();  // some other function
void foo() { stuff; ...;   leaf(); }    // ends with a tailcall
void bar() { stuff; foo(); stuff; }

An event that fires while foo is doing "stuff" will have a backtrace of bar->foo.

With no optimization, an event that fires in "leaf" would have a backtrace like bar->foo->leaf.

With an optimized tailcall, it would be bar->leaf, no foo because foo just jumped to leaf, forwarding its return address to leaf so when leaf eventually returns it will be directly to bar.

This works with args and return values, especially register args. Not always with stack args, for example it's impossible if leaf had more stack args than foo.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • 1
    Indeed, the `gmallocn()`s were inlines, but, unfortunately, `Perf` did not express it, directly (which is not consistent with `Perf` normal behavior). The second case (repeated `gtk_action_muxer_query_action()`s) was tail call elimination. In other words, the the last operation in the callee at `gtk_action_muxer_query_action+0x6c` is a jump to the beginning of `gtk_action_muxer_query_action()`. – TheAhmad Apr 10 '20 at 18:58
  • I have one more question. In the first case, `__GI___libc_malloc+0x197` is reported as inline. While its address (i.e., `0x7ffff4b07207`) does not match that of its parent function call point (i.e., `gmalloc+0x59` which is `0x7fffd9872fb9`). Is this logical? And, again, there are many such instances in the `Perf` output. – TheAhmad Apr 10 '20 at 19:06
  • @TheAhmad: IDK, it was a while since I looked at your question. But IIRC there were some things like that in the backtrace where the listed call site didn't seem to match the disassembly. IDK what's going on with those cases. – Peter Cordes Apr 10 '20 at 19:08
  • Sure. But based on your answer, they were logical. Because, in the first case they were inlines (in fact, this is consistent with the way that `GDB` prints inlines in the backtraces) and in the second case the missing functions were missed because of tail-call elimination (and this is also consistent with `GDB`). But this (**third**) case is not understandable (even if it is consistent with `GDB`, which I did not check). I mean inlines should have the **same** addresses. – TheAhmad Apr 10 '20 at 19:28
  • 1
    @TheAhmad: I already knew about tailcalls and inlining when looking at your original question and commenting. IIRC there seemed to be calls that apparently came from somewhere that isn't a call instruction, so you wouldn't expect a return address to be pointing there. Yeah, [my very first comment](https://stackoverflow.com/questions/60766026/wrong-perf-backtraces/61110066?noredirect=1#comment107512409_60766026) was about an off-by-one in a return address. – Peter Cordes Apr 10 '20 at 19:31
  • Well, I'm not sure if this is a problem. But my knowledge is not enough, here. AFAIK (which is not reliable at all), in all of my backtraces, the return address is a value strictly greater than the call address and strictly less than the next instruction (which seems to be the reasonable return point). – TheAhmad Apr 10 '20 at 19:43
  • 1
    @TheAhmad: `call` pushes the address of the next instruction as the return address. https://www.felixcloutier.com/x86/call. (RIP during execution of an instruction is the address of the next one, aka one-past-the-end of the call instruction itself.) `ret` is just a pop into RIP; it's an indirect jump to that address (normally one pushed by a `call`). Off-by-one is not normal. It should be exactly the next instruction because that's what should execute after a call returns. – Peter Cordes Apr 10 '20 at 19:48
  • I agree with you on this error. The problem is that this off-by-one is consistent in the whole backtrace. I don't want to take up your time. Thanks. – TheAhmad Apr 10 '20 at 20:00
  • 1
    I checked the **off-by-1** error in a newly installed Ubuntu 18.04` machine` (similar to my real machine) and the error is there, too. I checked this issue in a **16.04** guest OS with a **mainline linux kernel 4.x.x**. The output format is different and there is **no** off-by-1 error. Perhaps, there is a bug in newer versions of `Perf`. – TheAhmad Apr 11 '20 at 21:13
  • And I traced the code for the third case (i.e., reporting `__GI___libc_malloc+0x197` as `inlined`). Based on that, I think that there is a bug in `Perf` that incorrectly reports this called function as inlined. I mean the third case is a **False-Positive** bug while the first case (i.e., consecutive `gmallocn()`s) was a **False-Negative** one. – TheAhmad Apr 11 '20 at 21:22
  • 1
    @PeterCordes - FYI, perf can report inline functions in some modes, like DWARF unwinding, since the DWARF info can indicate something like "this IP corresponds to function a() inlined into b()", in which case perf can emit a() and b() in the stack, i.e., expand one physical stack frame into multiple displayed ones. – BeeOnRope Apr 19 '20 at 03:48