2

I am currently profiling my code, which is C99 compliant, with gprof. My program is defined recursively. I am compiling using gcc with the -pg flag with some warnings enabled, and no optimisation flags. Then I am running the code followed by the command:

gprof ./myProgram > outputFile.txt

I am very confused as to what the output is telling me, though. It is saying that I am spending a lot of time in a function that I know for a fact I do not spend a lot of time in. The function in question has a loop inside and performs a task a set number of times. This is a tiny task, though, and by increasing the number of times the task is completed from, say, 5 to 100 I get no visible change in the total amount of time taken for my program to execute. However, gprof is telling me that I have spent almost 50% of the execution time in this function.

I have no idea what the issue is here. Has anyone encountered something like this before? Can I resolve this myself? I am using a CentOS distribution of Linux. I have tried to install OProfile as a different profiler, but don't have write access to /usr/local, so the make install fails, so not sure if this is only an issue with gprof

Here are the relevant parts of the output file. The weird output is for the function nl_jacobi_to_converge:

From the flat profile I have

  %   cumulative   self              self     total           
 time   seconds   seconds    calls   s/call   s/call  name    
 49.02      7.50     7.50     1562     0.00     0.01  calculate_resid_numeric_jac_diags_tc41
 16.41     10.01     2.51 323878716     0.00     0.00  c
 14.97     12.30     2.29 323878716     0.00     0.00  b
  6.01     13.22     0.92       89     0.01     0.02  calculate_resid_tc41
  5.69     14.09     0.87 10967082     0.00     0.00  rhs_function
  1.11     14.26     0.17     1562     0.00     0.00  single_nl_jacobi
  ...
  0.00     15.30     0.00       11     0.00     0.76  nl_jacobi_to_converge

And from the granularity I have

index % time    self  children    called     name
                0.00    8.32      11/11          nl_multigrid [3]
[5]     54.4    0.00    8.32      11         nl_jacobi_to_converge [5]
                5.28    2.92    1100/1562        calculate_resid_numeric_jac_diags_tc41 [4]
                0.12    0.00    1100/1562        single_nl_jacobi [17]
jww
  • 97,681
  • 90
  • 411
  • 885
Keeran Brabazon
  • 399
  • 1
  • 3
  • 13
  • It would help if you pasted the essential part of your output file. – Chiel Feb 04 '14 at 21:08
  • @Chiel I have put in some extra information from the file. The function nl_jacobi_to_converge, although it sounds like it would take a long time operates on a very small dataset, and should have no effect on the overall run time – Keeran Brabazon Feb 04 '14 at 21:17
  • 1
    As said in the [wikipedia](http://en.wikipedia.org/wiki/Gprof) "Mutual recursion and non-trivial cycles are not resolvable by the gprof approach (context-insensitive call graph)". So, you can try [gperftools](https://code.google.com/p/gperftools/)'s [CPU profiler](http://gperftools.googlecode.com/svn/trunk/doc/cpuprofile.html) which has full callchain recording (context-sensitive). – osgx Feb 05 '14 at 12:54
  • 1
    @osgx: Re: [*gperftools CPU profiler*](http://gperftools.googlecode.com/svn/trunk/doc/cpuprofile.html): sigh - so near yet so far. 1) If only it sampled on wall-clock, not CPU time, it could detect needless blocked time. 2) If only it counted only one "hit" per function per stack sample, it would be handling recursion properly. 3) If only it allowed the users to actually see random samples, they could see line-of-code and other issues not exposed by function summaries. 4) If users could more easily throttle the sampling, the sample set would be relevant to the time experienced by the users. – Mike Dunlavey Feb 05 '14 at 15:39

2 Answers2

3

You are one of numerous people trying to figure out what gprof is telling them, to four decimal places. I use random pausing and examining the stack.

In the first place, gprof is a "CPU profiler". That means during IO, mutex, paging, or any other blocking system calls, it is shut off, and doesn't count that time. You say you're doing nothing of the sort, but it could be happening deep inside some library function. If it were, gprof masks it. On the other hand, a single stack sample will show what it is waiting for, with probability equal to the fraction of time it is waiting.

Second, as @keshlam said, it's important to understand about "self time". If it's high, that means the program counter was found in that routine a good percent of the CPU time. However, if it's low, that does not mean the function isn't guilty. The function could be spending lots of time, but doing it by calling subfunctions. To see that, you need the gprof "total" column, but as a percent of total time, not as an absolute time per call, which is what it is giving you. If you take a stack sample, then any routine will appear on it with probability equal to the fraction of time it is spending. What's more, you will know exactly why that time is being spent, because the sample will show you the precise line numbers where the calls occur.

ADDED: gprof attempts to handle recursive functions, but as its authors point out, it does not succeed. However, stack samples have no problem with recursion. If a stack sample is taken during a recursive function call, the function appears more than once on the stack, possibly many times. However, it is still the case that the inclusive time cost of a function, or of any line of code that calls a function, is simply the fraction of time it is on the stack. To see this, suppose samples are taken at a constant frequency, for a total of M samples, and a particular function or line of code appears on fraction F of them. If that function or line of code could be made to take no time, such as by deleting it, branching around it, or passing it off to an infinitely fast processor, then it would have no exposure to being sampled. Then the M*F samples on which it had appeared would disappear, shortening execution time by fraction F.

Community
  • 1
  • 1
Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
  • I am still new to profiling, and have never inspected the stack or the heap myself. Could you recommend a tool for me to inspect the stack? I realise that this could go into another question, but I would like to have a quick personal opinion on a good tool to use – Keeran Brabazon Feb 04 '14 at 23:36
  • 2
    Keeran Brabazon, `gdb` or `pstack` on linux. In gdb do: `gdb --args ./yourprogram your args`, then `run`, and then, after some time press `Ctrl-C` to stop. You can see the stack with `backtrace` or `where` commands, and then continue the execution with `continue`. Then repeat Ctrl-C several times. But personally I think that this Poor Man's Profiler (http://poormansprofiler.org/) is something that should be left in 1980s. – osgx Feb 04 '14 at 23:45
  • 1
    @osgx: No question, the manual method is crude. I wouldn't be such a pest about it if it weren't for one little thing - it finds whatever the tools find, plus stuff they don't, for the same reason that debuggers can't find bugs without a human working them. – Mike Dunlavey Feb 05 '14 at 02:31
  • @MikeDunlavey The issue I am having is because my program is recursively defined. I see that you address this issue in the answer that you link to here. I will update the description in my question to include this information. If you could update your answer here to make a quick note that gprof will give weird output for recursively defined code then I will put it as the accepted answer – Keeran Brabazon Feb 05 '14 at 10:29
0

"Self" time is time actually spent in that function's code. "Cumulative" time is time spent in that function and anything that function calls (self plus children).

Note that percentages are percentage of measured time. A function which is running a great many times (such as your b) will obviously affect runtime more than a function which only runs a few times (your n1_jacobi_to_converge). The time spent in it per call (/call) may be too small to report, but the total can still be significant and may still suggest that it's worth looking for ways to reduce the number of calls and/or improve the efficiency of each invocation.

WARNING: Measurements before the JIT stabilizes are meaningless. And since JIT is nondeterministic, especially but not solely in large multithreaded applications, performance may vary quite substantially from run to run even when the Java code is unchanged. Take multiple measurement passes, with sufficient warmup before measuring, and even then be prepared to sanity-check the profiler's suggestions because they may be misleading.

ANOTHER WARNING: Check whether semaphore-wait time is being measured, and make sure you know what's happening in all the threads. If you aren't counting waits, you may miss performance bottlenecks that aren't in your direct call stack.

(I spent a lot of time optimizing Apache Xalan and its IBM-internal descendents. There, we also had code generation involved in the mix, which had its own nondeterminacies. The initial performance improvements were easy and obvious; after that it became much more challenging!)

keshlam
  • 7,931
  • 2
  • 19
  • 33
  • Thanks for the advice, but the problem is that the code is sequential, single threaded and in C. The profiler should not have too many problems, as I see it. I know for a fact that very little time is spent in nl_jacobi_to_converge as when I change the number of loops performed within this method (say from 5 to 100) the actual run time does not change at all (the timing in time.h cannot pick up the difference in running time). I am sure that the correct code is being called, and don't know how to explain that gprof thinks so much time is spent in this method – Keeran Brabazon Feb 04 '14 at 21:32
  • The table agrees with you that very little time is spent in that function itself -- self time is zero, percentage of total time is zero. The _cumulative_ time in that function is high, suggesting that the code it calls (including indirectly) is accounting for that cost. Dive deeper. – keshlam Feb 04 '14 at 21:41
  • Note that it's perfectly possible that the bottleneck is something like I/O, in which case changing the number of iterations -- unless it also changes I/O -- won't affect total time much. Another thought is to watch out for "first time" initializations and hazards of that sort. – keshlam Feb 04 '14 at 21:44
  • There is no I/O. The only thing in the function is a for loop with a call to the child functions. I don't initialise anything in this function. The cumulative time spent in this method should be no more than 0.1 seconds, absolutely maximum, and experiments I have done show that changing the number of loops in this method has no effect on the overall run time. I am stumped as to why the profiler is telling me that I am spending time in here and child functions – Keeran Brabazon Feb 04 '14 at 21:52
  • You can prove it to yourself by commenting out the body of this function and seeing whether the execution time drops. "Should be" may just mean you're overlooking something, and "overall run time" may include time that isn't being accounted for here at all. – keshlam Feb 04 '14 at 21:56
  • (When you think a tool that others are using successfully is lying to you, it's possible you've found a bug, but it's at least as likely you aren't asking it the right question.) – keshlam Feb 04 '14 at 21:57
  • I have put a timing using the clock() method in the function, and this cannot pick up the time spent in here, although this is quite coarse grained. I have also emptied out the body, and then the profiler does not say I spend a large amount of time in there. Could the problem be that child method is a method that overall takes up a lot of time cumulatively, but not when it is called from the function I am interested in? I hope that makes sense – Keeran Brabazon Feb 04 '14 at 22:09
  • Cumulative time includes only that portion of the time spent in children when this function is on the call stack, so... – keshlam Feb 04 '14 at 22:17
  • @keshlam: I think you mean "total". "cumulative" is just a running sum of the next column over, the "self" column. Also, I've never heard of `c` code using a jitter. – Mike Dunlavey Feb 04 '14 at 22:41