How to do wall-clock profiling with dtrace? Or, how to count process-not-running samples with profile provider?

Question

I'm developing a plugin for a third-party host application on OSX, using C++. It is compiled as a .dylib. I wish to profile my plugin as it runs in the host application.

Unfortunately the host calls the plugin code at a rate that varies depending on the plugin's (last) execution time. This means that the process's overall time can vary considerably relative to real time. Therefore with a sampling profiler the 'time spent' within the plugin isn't really grounded to anything useful, as it's only compared to stack frames that fall within the process. If I improve the performance of the plugin, then the pattern of the host's execution of the plugin will change accordingly and it will be very difficult to measure improvements within the plugin.

I am able to use Instruments but as far as I can tell I can only get relative time against the process's CPU time.

I've used dtrace to obtain a user stack histogram of the host process:

#!/usr/sbin/dtrace -s

#pragma ustackframes 100
#pragma D option quiet

/* $1 is pid  */
/* $2 is sample rate in Hz (e.g. 100)  */
/* $3 is duration (e.g. '20s')  */

profile-$2
/pid == $1 && arg1/
{
  @[ustack()] = count();
}

tick-$3
{
  exit(0);
}

This works, but it still only provides samples relative to the process time, as the predicate is only matched then the process is in user-space. Even removing the && arg1 condition to trigger it during the process's kernel calls doesn't really help.

What I really want to know is how many profile-n samples resulted in the process not running at all. Then I can compare the number within my plugin against the total number of samples, and get absolute sample values for my plugin's functions. This makes me wonder - is it safe to assume that the requested profile-n sample rate is honoured? Can I simply take time * sample rate and use that to calculate the 'off-process' time? I had assumed that at, say, 1500Hz, it was dropping samples and running at some other, unknown, rate, but if I can be sure it's sampling at 1500Hz then I can work out the 'off-process' time from that.

Alternatively, is there a known way to do wall-clock profiling with dtrace?

I would ask [*Brendan Gregg*](http://dtrace.org/blogs/brendan/2012/12/13/usenix-lisa-2012-performance-analysis-methodology/). He's an expert on DTrace, and he shows how to do what he calls Stack Profiling, similar to [*Random Pausing*](http://stackoverflow.com/a/378024/23771), except he shows how to do it "on-CPU" and "off-CPU", but not wall-clock time. If anybody knows how, he should. I would differ from him in how to analyze the stack samples. He gets into flame graphs, which I regard as eye-candy. IMHO one should carefully examine each of several samples, as if it were pointing to a bug. — Mike Dunlavey, Feb 01 '14 at 14:28

Robert Harris · Answer 1 · 2014-02-03T23:42:45.847

This makes me wonder - is it safe to assume that the requested profile-n sample rate is honoured?

On Solaris, it's not guaranteed to be honoured: some old hardware lacks the necessary support for arbitrary-resolution timer-based interrupts. I would assume that the same theoretical limitation applies to OS X's DTrace.

In any case, you can test the timer resolution for yourself. The documentation for the profile provider includes an appropriate script and has a bit more on the topic. Here's another script to address your specific question:

bash-3.2# cat test.d
uint64_t last;

profile-1500
/cpu == 0/
{
    now = timestamp;
    @ = lquantize(now - last, 500000, 800000, 30000);
    last = now;
}   

tick-1
/i++ == 10/
{
    exit(0);
}
bash-3.2# dtrace -qs test.d


           value  ------------- Distribution ------------- count    
          560000 |                                         0        
          590000 |@@@                                      1041     
          620000 |@@@@@@@@@@                               4288     
          650000 |@@@@@@@@@@@@@@                           5680     
          680000 |@@@@@@@@@@                               3999     
          710000 |@@@@                                     1451     
          740000 |                                         0        
          770000 |                                         0        
       >= 800000 |                                         1        

bash-3.2#

Note that, in practice, you should sample at a frequency that's a prime number: this prevents you from synchronising with other, regularly scheduled, system activity.

Following the discussion in the comments, here's how you can measure the elapsed time spent inside a given function:

pid$target:mylib:myfunc:entry
/!self->depth/
{
    self->depth = ustackdepth;      /* beware recursion */
    self->start_time = timestamp;   /* for relative wall time calculations */
    self->start_vtime = vtimestamp; /* CPU time */      
}

pid$target:mylib:myfunc:return
/ustackdepth == self->depth/
{
    printf("%d ms (real) %d ms (CPU)\n",
        (timestamp - self->start_time) / 1000000,
        (vtimestamp - self->start_vtime) / 1000000);
    self->depth = 0;
}

If the function is called at a high frequency then clearly you could maintain aggregations of the elapsed times, e.g. to calculate the average costs of the function.

It's entirely possible to perform a similar exercise for all of the functions in your library, although it can be quite an onerous task to eliminate spurious results from recursion and tail-call optimisation. To be more useful you would probably also want to exclude from a function's cost the time spent calling up the stack; this makes it even harder work (but not impossible). Thus, armed with the means above to create an objective benchmark I would be more inclined to persist with the profiling approach, probably something like

# cat sample.d 
profile-997
/pid == $target && arg1 >= $1 && arg1 < $2/
{
    @[ufunc(arg1)] = count();
}

END
{
    trunc(@,5);
    exit(0);
}
#

This captures the five most-frequently seen functions within a given region of memory. For example (and using pmap on Solaris to locate libc),

# dtrace -qs sample.d -p `pgrep -n firefox` 0xfc090000 0xfc200000
^C

  libc.so.1`mutex_lock_impl                                        35
  libc.so.1`clear_lockbyte                                         46
  libc.so.1`gettimeofday                                           71
  libc.so.1`memset                                                 73
  libc.so.1`memcpy                                                170
#

This turns out to be quite a good illustration of the benefit of sampling: memcpy() and memset() are hand-coded in assembly --- i.e. we find that the most time-consuming functions have already been optimised.

Thanks, on OSX Mavericks that gives me 13950/16451 counts at 650000. Another run gives 16114/16451 - a very narrow spread. The total always seems to be around 16000, usually 16451 incidentally. Thanks for the tip on a prime number frequency. — davidA, Feb 02 '14 at 22:41
That should give you some confidence in your sampling: you're profiling at about 1496 Hz instead of 1500 Hz with a narrow distribution of intervals. I wasn't really clear what you meant by the rest of your question. What exactly do you mean by wall-clock profiling? Do you want to measure the fraction of wall-clock time (as opposed to CPU-time) that a particular process spends within a given library? If so, are you interested in when functions from that library appear on the process's call stack, or when the process's program counter is in the library's text section? — Robert Harris, Feb 03 '14 at 11:10
I suppose I'm looking for a real-time measurement of the library's performance. The (my) library code will be invoked at a fixed rate, but the host's calling environment may have variable CPU-time before and after each call, so comparing the sample count of the library vs the host isn't very useful. I'm also plotting with a FlameGraph, which is (currently) all relative. My goal really is to work out whether improvements in my library are actually real improvements. I find the relative measurements of Instruments (and dtrace as I'm currently using it) difficult to use in this particular case. — davidA, Feb 03 '14 at 20:39
However, as you say I now have some confidence in my sampling, so I can use this to work out the absolute time (insofar as this sampling technique permits such a calculation) spent in each library function. — davidA, Feb 03 '14 at 20:40
OK, so what is your definition of "real-time measurement of the library's performance"? The elapsed time (wall- or CPU-) for each invocation of the library? The time for each function? Something else? — Robert Harris, Feb 03 '14 at 21:15
Fair question - the plugin (i.e. library) has a single entry-point that is called periodically, so I'm interested in two sets of things: 1. how much wall-time the CPU spends within the plugin - this gives me some idea of the "cost" of the plugin relative to other plugins; 2. within the plugin, which functions are the most expensive in terms of time spent. I'm trying to determine which functions of the plugin could benefit most from optimisation in an effort to reduce the total wall time the CPU spends within the plugin. — davidA, Feb 03 '14 at 22:06
The relative measurements do help, but they are measured relative to something that I have no control over - the host's behaviour. So as that changes, I have no way of comparing trials unless I have a fixed reference to compare against (e.g. wall-time, or a known total sample count). — davidA, Feb 03 '14 at 22:09

How to do wall-clock profiling with dtrace? Or, how to count process-not-running samples with profile provider?

1 Answers1