5

I have been using Yourkit 8.0 to profile a mathematically-intensive application running under Mac OS X (10.5.7, Apple JDK 1.6.0_06-b06-57), and have noticed some strange behavior in the CPU profiling results.

For instance - I did a profiling run using sampling, which reported that 40% of the application's 10-minute runtime was spent in the StrictMath.atan method. I found this puzzling, but I took it at it's word and spent a bit of time replacing atan with an extremely simply polynomial fit.

When I ran the application again, it took almost exactly the same time as before (10 minutes) - but my atan replacement showed up nowhere in the profiling results. Instead, the runtime percentages of the other major hotspots simply increased to make up for it.

To summarize:

RESULTS WITH StrictMath.atan (native method)
Total runtime: 10 minutes
Method 1: 20%
Method 2: 20%
Method 3: 20%
StrictMath.atan: 40%

RESULTS WITH simplified, pure Java atan
Total runtime: 10 minutes
Method 1: 33%
Method 2: 33%
Method 3: 33%

(Methods 1,2,3 do not perform any atan calls)

Any idea what is up with this behavior? I got the same results using EJ-Technologies' JProfiler. It seems like the JDK profiling API reports inaccurate results for native methods, at least under OS X.

avalys
  • 3,662
  • 4
  • 25
  • 22

6 Answers6

3

This can happen because of inconsistencies of when samples are taken. So for example, if a method uses a fair amount of time, but doesn't take very long to execute, it is possible for the sampling to miss it. Also, I think garbage collection never happens during a sample, but if some code causes a lot of garbage collection it can greatly contribute to a slowdown without appearing in the sample.

In similar situation I've found it very helpful to run twice, once with tracing as well as once with sampling. If a method appears in both it is probably using a lot of CPU, otherwise it could well just be an artifact of the sampling process.

Nick Fortescue
  • 43,045
  • 26
  • 106
  • 134
0

I find YourKit greatly exaggerates the cost of calling sub-methods (due to its logging method, I assume). If you only follow the advice that the profile gives you you'll end up just merging functions with no real gain as the HotSpot usually does excellently for this.

Therefore, I'd highly advise to test batches completely outside profilers too, to get a better idea whether changes are really beneficial (it may seem obvious but this cost me some development time).

Carl Manaster
  • 39,912
  • 17
  • 102
  • 155
Pool
  • 11,999
  • 14
  • 68
  • 78
0

Since you're using a Mac, you might try Apple's Shark profiler (free download from ADC) which has Java support and Apple's performance group has put a fair amount of time into the tool.

As Nick pointed out, sampling can be misleading if the sample interval is close enough to the function's execution time and the profiler rarely checks when the function is actually executing. I don't know whether YourKit supports this but in Shark you can change the sampling interval to something other than the default 10ms and see if the results are substantially different. There's also a separate call-tracing mode which will record every function enter/return - this completely avoids the possibility of aliasing errors but collects a ton of data and higher overhead, which could matter if your app is doing any sort of real-time processing.

Chris Adams
  • 4,966
  • 1
  • 30
  • 28
0

You may want to look at the parameters that are being passed into the three methods. It may be that the time is being spent generating return values or in methods that are creating a lot of temporary objects.

dhable
  • 2,879
  • 26
  • 35
0

It worth noting that Java methods can be inlined if they are small enough, however native methods are inlined under different rules. If a method is inlined it doesn't appear in the profiler (certainly not YourKit anyway)

Peter Lawrey
  • 525,659
  • 79
  • 751
  • 1,130
-1

Profilers can be like that.

This is the method I use.

Works every time.

And this is why.

Community
  • 1
  • 1
Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
  • Like to explain why this leads to different results than those from a sampling profiler which automates this process? – ziggystar Jun 12 '13 at 15:26
  • @ziggystar: Some sampling profilers automate half of the process, namely the ones that sample the whole stack, on wall-clock (not CPU) time. The half of the process they don't automate is the discovery of performance problems you can fix. If a statement or function has small inclusive %, then the problem is elsewhere, but that doesn't narrow down where the problem is very much. Simply knowing that a line of code, function, or "path" is "hot" does not tell you much. It's more revealing if you can examine specific representative samples, in their entirety. ... – Mike Dunlavey Jun 13 '13 at 17:09
  • @ziggystar: ... The different result this leads to is that you find speedups you would not find with even a very good profiler. When you fix those and repeat the process, problems that were small before are now a larger % because the program takes less time. So, one by one, you fix these, and the compounded speedup can be surprising. How much speedup did you ever get with a profiler? If the answer is more than 40% it would surprise me. – Mike Dunlavey Jun 13 '13 at 17:13
  • More than 10x, using jvisualvm. Not once, but consistently (I'm usually optimizing a high level language: Scala). Just today I've optimized one of my programs from 10x slower than an optimized C++ library to less than 4x slower. My code was already rather optimized, but the C++ performance is in reach. And about the rest: I've run into the same problem mentioned here with a sampling profiler (jvisualvm). And I've also done manual stack dumps. And they also favor the native method. It's a problem of the JVM stack dumping, I suppose. – ziggystar Jun 13 '13 at 19:59
  • @ziggystar: 10x is pretty good. [*Here's 43x*](http://stackoverflow.com/a/927773/23771), and [*here's 730x*](http://sourceforge.net/projects/randompausedemo/files/). The difference is not in the getting of the samples but in what's done with them once they are gotten. Then it turns out the need for a large number of samples is not based on anything. The statistical justification is discussed [*here*](http://scicomp.stackexchange.com/a/2719/1262). – Mike Dunlavey Jun 13 '13 at 22:13