0

I would like to measure time (or occurences) of very specific parts in a C code (they may be limited to a few instructions in some functions). One purpose is to track local performance improvements or regressions over several code revisions.

I know I can define macros for that purpose. But is there any tool which already does that in an even less intrusive way? Using annotations (#pragma) would be perfect:

void func_to_profile()
{
    /* Some instructions */
    ...


#pragma profile foo start
    /* A part of the code to track */
    ...
#pragma profile foo stop


    /* More instructions */
    ...


#pragma profile bar start
    /* Another part to measure */
    ...
#pragma profile bar stop
}

Ideally, at the end of the run the tool would display the cumulated elasped times per subsections. For instance:

-- [foo] cumulated time: 42s
-- [bar] cumulated time: 7s

Is there any existing tool which already does that or do I have no choice but develop my own GCC plugin?

jyvet
  • 2,021
  • 15
  • 22
  • What's the difference between writing `#pragma` and using a macro? Other than the pragma has no chance to be portable while macros can be? – Art Apr 14 '16 at 06:24
  • 1
    There isn't a direct mapping between lines of C and optimized asm output. Forcing the compiler to do a certain part of the work between two barriers could lead to significantly worse code. Your best bet is going to be looking at CPU performance counters (e.g. linux `perf`) to find execution hotspots. On x86, even the lightest-weight timing instrumentation (CPUID) is ~20 cycles of overhead, so it's far too heavy to measure just a couple instructions. Use it to measure all the iterations of a loop together, not each one separately. – Peter Cordes Apr 14 '16 at 06:25
  • If you're doing this to find speedups, you need to think about it a little differently. You surely have multiple speedup opportunities in the code, so finding just one of them will not be good enough. The chance that they are all hotspots is pretty small. Try a [*method that speedups cannot hide from*](http://stackoverflow.com/a/25870103/23771). – Mike Dunlavey Apr 14 '16 at 10:44
  • @Art With macros, I would have to include manually the header containing the definition of the macros + a macro to specify I want to activate the profiling. I would have also to explicitly call a function to print the results at the end of my program and in some cases an initialization function at the beginning. With plugins all that calls might be automatically inserted compile time, leaving just the pragmas start/stop. – jyvet Apr 14 '16 at 21:17
  • @PeterCordes you are absolutely right, CPU performance counters could be well suited finding hotspots in some cases.`perf` does a good job. Of course, I don't want to measure time at a few cycles granularity, and even less in a most inner loop. I updated my question, what I'd like to achieve here is to track performances of some code sections for comparison (across code versions, for instance to detect local performance improvements or regressions). The sections are already identified as time consuming and are on the critical path (this identification job might have been done with perf). – jyvet Apr 14 '16 at 21:20
  • @MikeDunlavey thx for sharing this URL. I like your post (well explained and quite usefull). As explained in my previous reply, I'd also like to compare times of those code sections between several code revisions. – jyvet Apr 14 '16 at 21:23

1 Answers1

0

perf record for an event like core clock cycles will accumulate events to instructions. It's not precise, though: the instructions that get the event counts aren't always the ones that are slow themselves, just nearby and e.g. stuck waiting for the slow thing to happen. But close enough to be potentially useful.

It seems like all you'd need is to look at the counts per insn or per line of C (mapping via debug info) and see when the relative counts change.

That should work to identify when a change makes a certain part of the asm run slower: the relative share of the counts from perf events will be higher for insns associated with that source line. (+/- a lot of hand-waving because lines of C don't always map directly to asm instructions, esp. when optimization restructures some branch logic, auto-vectorizes, etc.)


It might be possible to cook up an automated test procedure that runs your code under perf record, then massages the perf report data into some kind of format that can be compared / tracked when comparing source versions.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847