10

I try to do some nodejs profiling using Linux perf_events as described by Brendan Gregg here.

Workflow is following:

  1. run node >0.11.13 with --perf-basic-prof, which creates /tmp/perf-(PID).map file where JavaScript symbol mapping are written.
  2. Capture stacks using perf record -F 99 -p `pgrep -n node` -g -- sleep 30
  3. Fold stacks using stackcollapse-perf.pl script from this repository
  4. Generate svg flame graph using flamegraph.pl script

I get following result (which look really nice at the beginning): enter image description here

Problem is that there are a lot of [unknown] elements, which I suppose should be my nodejs function calls. I assume that whole process fails somwhere at point 3, where perf data should be folded using mappings generated by node/v8 executed with --perf-basic-prof. /tmp/perf-PID.map file is created and some mapping are written to it during node execution.

How to solve this problem?

I am using CentOS 6.5 x64, and already tried this with node 0.11.13, 0.11.14 (both prebuild, and compiled as well) with no success.

Kamil Z
  • 653
  • 8
  • 20

2 Answers2

20

FIrst of all, what "[unknown]" means is the sampler couldn't figure out the name of the function, because it's a system or library function. If so, that's OK - you don't care, because you're looking for things responsible for time in your code, not system code.

Actually, I'm suggesting this is one of those XY questions. Even if you get a direct answer to what you asked, it is likely to be of little use. Here are the reasons why:

1. CPU Profiling is of little use in an I/O bound program

The two towers on the left in your flame graph are doing I/O, so they probably take a lot more wall-time than the big pile on the right. If this flame graph were derived from wall-time samples, rather than CPU-time samples, it could look more like the second graph below, which tells you where time actually goes:

enter image description here

What was a big juicy-looking pile on the right has shrunk, so it is nowhere near as significant. On the other hand, the I/O towers are very wide. Any one of those wide orange stripes, if it's in your code, represents a chance to save a lot of time, if some of the I/O could be avoided.

2. Whether the program is CPU- or I/O-bound, speedup opportunities can easily hide from flame graphs

Suppose there is some function Foo that really is doing something wasteful, that if you knew about it, you could fix. Suppose in the flame graph, it is a dark red color. Suppose it is called from numerous places in the code, so it's not all collected in one spot in the flame graph. Rather it appears in multiple small places shown here by black outlines:

enter image description here

Notice, if all those rectangles were collected, you could see that it accounts for 11% of time, meaning it is worth looking at. If you could cut its time in half, you could save 5.5% overall. If what it's doing could actually be avoided entirely, you could save 11% overall. Each of those little rectangles would shrink down to nothing, and pull the rest of the graph, to its right, with it.

Now I'll show you the method I use. I take a moderate number of random stack samples and examine each one for routines that might be speeded up. That corresponds to taking samples in the flame graph like so:

enter image description here

The slender vertical lines represent twenty random-time stack samples. As you can see, three of them are marked with an X. Those are the ones that go through Foo. That's about the right number, because 11% times 20 is 2.2.

(Confused? OK, here's a little probability for you. If you flip a coin 20 times, and it has a 11% chance of coming up heads, how many heads would you get? Technically it's a binomial distribution. The most likely number you would get is 2, the next most likely numbers are 1 and 3. (If you only get 1 you keep going until you get 2.) Here's the distribution:)

enter image description here

(The average number of samples you have to take to see Foo twice is 2/0.11 = 18.2 samples.)

Looking at those 20 samples might seem a bit daunting, because they run between 20 and 50 levels deep. However, you can basically ignore all the code that isn't yours. Just examine them for your code. You'll see precisely how you are spending time, and you'll have a very rough measurement of how much. Deep stacks are both bad news and good news - they mean the code may well have lots of room for speedups, and they show you what those are.

Anything you see that you could speed up, if you see it on more than one sample, will give you a healthy speedup, guaranteed. The reason you need to see it on more than one sample is, if you only see it on one sample, you only know its time isn't zero. If you see it on more than one sample, you still don't know how much time it takes, but you do know it's not small. Here are the statistics.

Community
  • 1
  • 1
Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
  • Kamil Z may try off-cpu profiling, if the task is I/O bound, Gregg just published the scripts - http://stackoverflow.com/a/28784580/196561 Also, taking "random stack samples" (for example with gdb, or even with `perf record -g`, then `perf script` to grab raw stack samples, then taking several random stacks for manual examination) may not help here because 1) it may be hard to attach to short-lived program (Tracing with LTTng may help there... ) 2) gdb (or perf record) still will not report symbol names, still the Y problem asked by Kamil. Internal node.js/v8 debugger may help to resolve them. – osgx Mar 03 '15 at 05:16
  • @osgx: With manual sampling, you don't have to care if its I/O bound or not, it works the same. You don't have to care if a routine is short-lived - if it is called enough times to use a significant fraction of time, samples will hit it. (If the entire program finishes too fast to sample, what I do is add a temporary outer loop.) I am assuming one is in the position of a programmer, with source code and a debugger. For other situations I'm not offering advice. – Mike Dunlavey Mar 03 '15 at 13:52
  • @MikeDunlavey The first point is not clear to me. Doesn't this problem happen only when the thread being profiled is the same thread that is generating the samples? In other words, the reason that the first graph doesn't show the I/O wait time is that it's waiting and couldn't have recorded any samples. On the other hand, if another thread is generating the samples, it would have captured all the wait and processing time of the target thread and the second graph would be generated. – Hadi Brais Nov 24 '15 at 10:24
  • @Hadi: What you're saying makes perfect sense (although the sampling thread spends almost all of its time blocked until time to take the next sample.) However, profiler makers are still confused about this. They think they have to make a distinction between on-CPU and off-CPU time, so when it's time to take a sample, they only take the sample if the thread being sampled is on-CPU or off-CPU. I suspect this confusion arises because they see their primary mission as accurate measurement, rather than the finding of speedups. – Mike Dunlavey Nov 24 '15 at 13:54
  • @MikeDunlavey "what "[unknown]" means is the sampler couldn't figure out the name of the function, because it's a system or library function." There is another reason, which is that perf could not find the symbol because the program(or managed runtime) generated the code for it and the program didn't provide a perf map file, and possibly that perf could not see the map file. See my answer. – ceeaspb Sep 23 '16 at 14:44
  • @MikeDunlavey Apart from that, great answer. – ceeaspb Sep 23 '16 at 14:53
  • @ceeaspb: You're right, of course. I'm just accustomed to people asking [*XY questions*](http://xyproblem.info/). I.e. they really want to get maximum performance, but they don't say so - they ask about peripheral issues. – Mike Dunlavey Sep 23 '16 at 15:34
  • The `1. CPU Profiling is of little use in an I/O bound program` can be solved by off-cpu profiling, right? – skytree Mar 23 '19 at 22:17
  • @skytree: Then you have to try profiling two ways, on-cpu and off-cpu. If you profile by wall-clock time you don't have to distinguish between the two. – Mike Dunlavey Mar 24 '19 at 03:14
11

Generally speaking it is a bad idea to disagree with a subject matter expert but (with the greatest respect) here we go!

SO urges the answer to do the following:

"Please be sure to answer the question. Provide details and share your research!"

So the question was, at least my interpretation of it is, why are there [unknown] frames in the perf script output (and how do I turn these [unknown] frames in to meaningful names)? This question could be about "how to improve the performance of my system?" but I don't see it that way in this particular case. There is a genuine problem here about how the perf record data has been post processed.

The answer to the question is that although the prerequisite set up is correct: the correct node version, the correct argument was present to generate the function names (--perf-basic-prof), the generated perf map file must be owned by root for perf script to produce the expected output.

That's it!

Writing some new scripts today I hit apon this directing me to this SO question.

Here's a couple of additional references:

https://yunong.io/2015/11/23/generating-node-js-flame-graphs/

https://github.com/jrudolph/perf-map-agent/blob/d8bb58676d3d15eeaaf3ab3f201067e321c77560/bin/create-java-perf-map.sh#L22

[ non-root files can sometimes be forced ] http://www.spinics.net/lists/linux-perf-users/msg02588.html

ceeaspb
  • 426
  • 4
  • 4