Why does Java CPU profile (using visualvm) show so many hits on a method that does nothing?

Question

This is something I think I've seen before with other profiling tools in other environments, but it's particularly dramatic in this case.

I'm taking a CPU profile of a task that runs for about 12 minutes, and it's showing almost half the time spent in a method that literally does nothing: it's got an empty body. What can cause this? I don't believe that the method is being called a ridiculous number of times, certainly not to account for half the execution time.

For what it's worth, the method in question is called startContent() and it's used to notify a parsing event. The event is passed down a chain of filters (perhaps a dozen of them), and the startContent() method on each filter does almost nothing except to call startContent() on the next filter in the chain.

This is pure Java code, and I'm running it on a Mac.

Attached is a screen shot of the CPU sampler output:

CPU sampler screenshot

and here is a sample showing the call stack:

CPU sampler sample call stack

(After a delay due to vacation) Here are a couple of pictures showing the output from the profiler. These figures are much more what I would expect the profile to look like. The profiler output seems entirely meaningful, while the sampler output is spurious.

CPU profiler output 1 CPU profiler output 2

As some of you will have guessed, the job in question is a run of the Saxon XML schema validator (on a 9Gb input file). The profile shows about half the time being spent validating element content against simple types (which happens during endElement processing) and about half being spent testing key constraints for uniqueness; the two profiler views show highlight the activity involved in these two aspects of the task.

I'm not able to supply the data as it comes from a client.

Can you switch to sampling the call stack instead so the overhead in the empty method is much smaller? — Thorbjørn Ravn Andersen, May 24 '14 at 23:03
Just a thought... not sure if it's related, but are you sure the empty method in the class you are looking at is the correct version? Could polymorphism be over-riding the empty method with a more complex method? — Brandon, May 24 '14 at 23:05
Is that inclusive or exclusive time? Does the program spend any blocked time (IO, sleep, lock wait, ...)? As one old-timer to another, I hesitate to tell you about profiling, but you have enough rep to see [*this*](http://stackoverflow.com/a/1779343/23771), which may give some transparency into the subject. — Mike Dunlavey, May 25 '14 at 12:52
@Brandon - it's a highly polymorphic method, but the profiler output is clearly distinguishing different overrides of the method — Michael Kay, May 25 '14 at 22:44
@Mike Dunlavey - It's inclusive time, but since it's a method that doesn't invoke any others, there's not really any difference. The process does have some I/O time, but nowhere near this (alleged) hotspot. I don't think there is any synchronisation blocking. — Michael Kay, May 25 '14 at 22:46
@MichaelKay: You're probably aware of [*this technique*](http://stackoverflow.com/a/378024/23771) that many others and I use. There are so many questions on SO where somebody uses a profiler and can't figure out what it says. Of course it still gets a lot of skepticism (which is good), but some of the objections are answered [*here*.](http://scicomp.stackexchange.com/a/2719/1262) — Mike Dunlavey, May 25 '14 at 23:06
Can you, please, post somewhere your CPU snapshot? I would like to examine it. — Tomas Hurka, May 26 '14 at 09:06
@Mike Dunlavey - interesting, the only time I hit pause at random in the debugger is when I'm looking for infinite loops, it never really occurred to me to try and get any statistical information that way. I've just always assumed that using a profiler would always be much better. — Michael Kay, May 26 '14 at 09:13
The screenshot of the flat profiler data is not enough for me. Can I get that 'snapshot 10:44:00'? — Tomas Hurka, May 26 '14 at 10:47
One more thing - the screenshot shows that you are not using profiler, but sampler. So there is no instrumentation and everything mentioned about the instrumentation overhead in the answer section below is invalid. — Tomas Hurka, May 26 '14 at 10:56
@Tomas Hurka, thanks. Sorry if my terminology is incorrect, it suggests I don't fully understand what these different charts are measuring. I've attached another screen shot showing a snapshot from the sampler. — Michael Kay, May 26 '14 at 11:32
Thanks for the screenshot, it provides more information than a flat profile from the first screenshot. Currently it looks to me like Sink.startContent() is really called ridiculous number of times. Are you sure that this is not this case? Do you have the steps how to reproduce your results, so I can try it myself? — Tomas Hurka, May 26 '14 at 12:02

score 8 · Answer 1 · answered May 24 '14 at 22:59

I have not used VisualVM, but I suspect the problem is likely because of the instrumentation overhead on such an empty method. Here's the relevant passage in JProfiler's documentation (which I have used extensively):

If the method call recording type is set to Dynamic instrumentation, all methods of profiled classes are instrumented. This creates some overhead which is significant for methods that have very short execution times. If such methods are called very frequently, the measured time of those method will be far to high. Also, due to the instrumentation, the hot spot compiler might be prevented from optimizing them. In extreme cases, such methods become the dominant hot spots although this is not true for an uninstrumented run. An example is the method of an XML parser that reads the next character. This method returns very quickly, but may be invoked millions of times in a short time span.

Basically, a profiler adds it's own "time length detection code", essentially, but in an empty method the profiler will spend all it's time doing that rather than actually allowing the method to run.

I recommend, if it's possible, to tell VisualVM to stop instrumenting that thread, if it supports such a filtering.

It's all one thread, so not instrumenting that thread would be rather pointless. I suspect that you're on the right lines that the time is actually being spent executing instrumentation code, though it raises the question as to why this method should show this effect and not others. However, T R Anderson's suggestion of sampling the call stack seems helpful, this seems to give results that seem reasonable. — Michael Kay, May 25 '14 at 21:58
Sampling the call stack is good; in JProfiler you can remove instrumentation *per method* not per thread. Don't know if you can in VisualVM. But yeah call stack sampling may be your best bet if it can't filter methods. — durron597, May 26 '14 at 01:35

score 2 · Answer 2 · edited Apr 13 '17 at 12:53

It is generally assumed that using a profiler is much better (for finding performance problems, as opposed to measuring things) than - anything else, really - certainly than the bone-simple way of random pausing.

This assumption is only common wisdom - it has no basis in theory or practice. There are numerous scholarly peer-reviewed papers about profiling, but none that I've read even address the point, let alone substantiate it. It's a blind spot in academia, not a big one, but it's there.

Now to your question -

In the screenshot showing the call stack, that is what's known as the "hot path", accounting for roughly 60% of in-thread CPU time. Assuming the code with "saxon" in the name is what you're interested in, it is this:

net.sf.saxon.event.ReceivingContentHandler.startElement
net.sf.saxon.event.ProxyReceiver.startContent
net.sf.saxon.event.ProxyReceiver.startContent
net.sf.saxon.event.StartTagBuffer.startContent
net.sf.saxon.event.ProxyReceiver.startContent
com.saxonica.ee.validate.ValidationStack.startContent
com.saxonica.ee.validate.AttributeValidator.startContent
net.sf.saxon.event.TeeOutputter.startContent
net.sf.saxon.event.ProxyReceiver.startContent
net.sf.saxon.event.ProxyReceiver.startContent
net.sf.saxon.event.Sink.startContent

First, this looks to me like it has to be doing I/O, or at least waiting for some other process to give it content. If so, you should be looking at wall-clock time, not CPU time.

Second, the problem(s) could be at any of those call sites where a function calls the one below. If any such call is not truly necessary and could be skipped or done less often, it will reduce time by a significant fraction. My suspicion is drawn to StartTagBuffer and to validate, but you know best.

There are other points I could make, but these are the major ones.

ADDED after your edit to the question. I tend to assume you are looking for ways to optimize the code, not just ways to get numbers for their own sake.

It still looks like only CPU time, not wall-clock time, because there is no I/O in the hot paths. Maybe that's OK in your case, but what it means is, of your 12-minute wall clock time, 11 minutes could be spent in I/O wait, with 1 minute in CPU. If so, you could maybe cut out 30 seconds of fat in the CPU part, and only shorten the time by 30 seconds. That's why I prefer sampling on wall-clock time, so I have overall perspective.

By looking at hot paths alone, you're not getting a true picture. For example, if the hot path says function F is on the hot path for, say 40% of the time, that only means F costs no less than 40%. It could be much more, because it could be on other paths that aren't so hot. So you could have a juicy opportunity to speed things up by a lot, but it doesn't get much exposure in the specific path that the profiler chose to show you, so you don't give it much attention. In fact, a big time-taker might not show up at all because on any specific hot path there's always something else a little bigger, like new, or because it goes by multiple names, such as templated collection class constructors.

It's not showing you any line-resolution information. If you want to inspect a supposedly high-cost routine for the reason for the cost, you have to look at the lines within it. There's a tendency when looking at a routine to say "It's just doing what it's supposed to do.", but if you are looking at a specific costly line of code, which most often is a method call, you can ask "Is it really necessary to do this call? Maybe I already have the information." It's far more specific in suggesting what you could fix.

Can it actually show you some raw stack samples? In my experience these are far more informative than any summary, like a hot path, that the profiler can present. The thing to do is examine the sample and come to a full understanding of what the program was doing, and the reason why, at that point in time. Then repeat for several more samples. You will see things that don't need to be done, that you can fix to get substantial speedup. (Unless the code is already optimal, in which case it will be nice to know.) The point is, you're looking for problems, not measurements. Statistically, it's very rough, but good enough, and no problem will escape.

Chris · Answer 3 · 2014-05-26T16:36:39.190

My guess is that the method Sink.startContent actually is called a ridiculous number of times.

Your screenshot shows the Sampling tab, which usually results in realistic timings if user over a long enoung interval. If you use Profiler tab instead, you will also get the invocation count. (You'll also get less realistic timings and your program will get very very slow, but you only need to do this for a few seconds to get a good idea about the invocation counts).

It's hard to predict what optimizations and especially inlining HotSpot performs, and the sampling profiler can only attribute the time of inlined methods to the call sites. I suspect that some of the invocation code in saxon might for some reason be attributed to your empty callback function. In that case, you're just suffering the cost of XML, and switching to a different parser might be the only option.

score 0 · Answer 4 · answered May 31 '14 at 07:40

0

I've had a lot of useful information and guidance from this thread, for which many thanks. However, I don't think the core question has been answered: why is the CPU sampling in VisualVM giving an absurdly high number of hits in a method that does nothing, and that isn't called any more often than many other methods?

For future investigations I will rely on the profiler rather than the sampler, now I have gained a bit of insight into how they differ.

From the profiler I haven't really gained a lot of new information about this specific task, in so far as it has largely confirmed what I would have guessed; but that itself is useful. It has told me that there's no magic bullet to speeding up this particular process, but has put bounds on what might be achieved by some serious redesign, e.g a possible future enhancement that appears to have some promise is generating a bytecode validator for each user-defined simple type in the schema.

answered May 31 '14 at 07:40

Michael Kay

156,231
11
92
164

I believe the problem here is that we cannot see how the _real_ method bodies look after the byte code transformations being done by the profiler - but it would be a good guess that they are not empty anymore. My gut feeling is that you are seeing one or more of a Hotspot JIT artifact (method inlining), CPU cache misses, memory page misalignment, which may or may not show in any subsequent run on this or other hosts. As it is a stand alone version of the Netbeans profiler, you might get better results if running inside Netbeans? – Thorbjørn Ravn Andersen Jul 15 '14 at 14:44
Call count is not time cost, and regarding the profiler, when you say "It told me there's no magic bullet to speeding up this particular process" you're saying roughly what I hear on this site from many profiler users - and that's my beef. Profilers *hide* speed ups, and there's a subtext: if the profiler didn't find much, the code must be close to optimal. Naturally people fall for it. Well, I've never seen any good-sized code that couldn't be squeezed, by a lot. [*Example.*](http://stackoverflow.com/a/927773/23771) [*730x*](http://scicomp.stackexchange.com/a/1870/1262) – Mike Dunlavey Jul 24 '14 at 20:47

Why does Java CPU profile (using visualvm) show so many hits on a method that does nothing?

4 Answers4