How to properly write a SIGPROF handler that invokes AsyncGetCallTrace?

Question

I am writing a short and simple profiler (in C), which is intended to print out stack traces for threads in various Java clients at regular intervals. I have to use the undocumented function AsyncGetCallTrace instead of GetStackTrace to minimize intrusion and allow for stack traces regardless of thread state. The source code for the function can be found here: http://download.java.net/openjdk/jdk6/promoted/b20/openjdk-6-src-b20-21_jun_2010.tar.gz in hotspot/src/share/vm/prims/forte.cpp. I found some man pages documenting JVMTI, signal handling, and timing, as well as a blog with details on how to set up the AsyncGetCallTrace call: http://jeremymanson.blogspot.com/2007/05/profiling-with-jvmtijvmpi-sigprof-and.html

What this blog is missing is the code to actually invoke the function within the signal handler (the author assumes the reader can do this on his/her own). I am asking for help in doing exactly this. I am not sure how and where to create the struct ASGCT_CallTrace (and the internal struct ASGCT_CallFrame), as defined in the aforementioned file forte.cpp. The struct ASGCT_CallTrace is one of the parameters passed to AsyncGetCallTrace, so I do need to create it, but I don't know how to obtain the correct values for its fields: JNIEnv *env_id, jint num_frames, and JVMPI_CallFrame *frames. Furthermore, I do not know what the third parameter passed to AsyncGetCallTrace (void* ucontext) is supposed to be?

The above problem is the main one I am having. However, other issues I am faced with include:

SIGPROF doesn't seem to be raised by the timer exactly at the specified intervals, but rather a bit less frequently. That is, if I set the timer to send a SIGPROF every second (1 sec, 0 usec), then in a 5 second run, I am getting fewer than 5 SIGPROF handler outputs (usually 1-3)
SIGPROF handler outputs do not appear at all during a Thread.sleep in the Java code. So, if a SIGPROF is to be sent every second, and I have Thread.sleep(5000);, I will not get any handler outputs during the execution of that code.

Any help would be appreciated. Additional details (as well as parts of code and sample outputs) will be posted upon request.

I can't help, but bravo for what you're trying to do! Also, you want it to be minimally intrusive, but I think it doesn't really need to be because not a lot of samples are needed. You should make sure it can take samples while waiting on I/O. Here's a general flame on that subject: http://stackoverflow.com/questions/1777556/alternatives-to-gprof/1779343#1779343 — Mike Dunlavey, Aug 06 '10 at 19:58
In fact the way I did this years ago in C was to use SIGALRM. You need something that triggers whether or not the thread(s) are blocked. And, if it's not particularly accurate, that doesn't matter at all. What matters is that the program have no influence on when the sample is taken. — Mike Dunlavey, Aug 06 '10 at 20:16
@Mike Dunlavey: Thanks for the info, but there may be a few problems with the suggested approach: 1. I would like to automate and generalize this profiling process. That is, my JVMTI code could be run with ANY Java client and would simply require the mention of one jvmarg (the .so file). Doing manual breaks every time would be time consuming and less systematic. — user400348, Aug 06 '10 at 21:06
2. My goal's to fine-tune programs designed to supply 100s of 1000s of transactions/sec to servers. Despite what the post in your link says, I am thinking that for these programs a large sample size and a more granular and consistent stack trace polling is preferable. Please correct me if you believe this argument is incorrect. 3. Regarding the SIGALRM approach: I am not sure accuracy is something I would want to neglect in my case, but I can look more into that. Nevertheless, I am still looking for a solution using AsyncGetCallTrace. So if you hear anything else, please do tell. Thanks again. — user400348, Aug 06 '10 at 21:07
There are profilers, RotateRight/Zoom and LTProf, that if you do what they do I think you're on the right track. They sample the call stack on wall-clock (not cpu) time. They report, per *line* (not function) the percent of stack samples containing that line. The fraction of time a line is on the stack is precisely how much wall-time would be saved if the line could be avoided or removed. Since the goal is to *find* the expensive lines you don't need, they are pinpointed by stack samples. The precision with which they are *measured* is a matter of taste, but *finding* is the goal. — Mike Dunlavey, Aug 06 '10 at 21:40
Here's how tuning goes, in my experience. You simulate a flat-out workload, so the program is running 100% of the time. Your sampling says that, say, 50% (give or take) of the time goes into problem A that when you see it, you think you can fix it. You fix that, and now you can do twice as many tps. You do it again, and see another issue B that also takes 50% (give or take) and get another factor of speedup. After a while, you get down to small stuff (fine tuning), and then you're close to optimal. That's the kind of tuning I do. The advantage of manual sampling is better insight. — Mike Dunlavey, Aug 06 '10 at 21:48
Not to be too technical about it, but if 10 random stack samples are taken, and a given line of code appears on 5 of them, the fraction of time F that that line is on the stack (and would therefore save if removed) follows a bell-curve (actually a beta distribution) with a maximum at 50%, but most likely in a range from 30% to 70%. So what it would save you is imprecisely known but is not small, while its location is very precisely known. Of course, multiply the number of samples by 100, and the measurement will be 10 times better. — Mike Dunlavey, Aug 06 '10 at 22:08

score 1 · Accepted Answer · answered Aug 10 '10 at 19:53

I finally got a positive result, but since little discussion was spawned here, my own answer will be brief.

The ASGCT_CallTrace structure (and the underlying ASGCT_CallFrame array) can simply be declared in the signal handler, thus existing only the stack: ASGCT_CallTrace trace; JNIEnv *env; global_VM_pointer->AttachCurrentThread((void **) &env, NULL); trace.env_id = env; trace.num_frames = 0; ASGCT_CallFrame storage[25]; trace.frames = storage;

The following gets the uContext: ucontext_t uContext; getcontext(&uContext);

And then the call is just: AsyncGetCallTrace(&trace, 25, &uContext);

I am sure there are some other nuances that I had to take care of in the process, but I did not really document them. I am not sure I can disclose the full current code I have, which successfully asynchronously requests for and obtains stack traces of any java program at fixed intervals. But if someone is interested in or stuck on the same problem, I am now able to help (I think).

On the other two issues: [1] If a thread is sleeping and a SIGPROF is generated, the thread handles that signal only after waking up. This is normal, since it is the thread's job to handle the signal. [2] The timer imperfections do not seem to appear anymore. Perhaps I mis-measured.

How to properly write a SIGPROF handler that invokes AsyncGetCallTrace?

1 Answers1