1

As far as I understand a sampling profiler works as follows: it interupts the program execution in regular intervals and reads out the call stack. It notes which part of the program is currently executing and increments a counter that represents this part of the program. In a post processing step: For each function of the program the ratio of the whole execution time is computed, for which the function is responsible for. This is done by looking at the counter C for this specific function and the total number of samples N:

ratio of the function = C / N

Finding the hotspots then is easy, as this are the parts of the program with a high ratio.

But how can this be done for a parallel program running on parallel hardware. As far as I know, when the program execution is interupted the executing parts of the program on ALL processors are determined. Due to that a function which is executed in parallel gets counted multiple times. Thus the number of samples C of this function can not be used for computing its share of the whole execution time anymore.

Is my thinking correct? Are there other ways how the hotspots of a parallel program can be identified - or is this just not possible using sampling?

animuson
  • 53,861
  • 28
  • 137
  • 147
Constantin
  • 55
  • 5

1 Answers1

1

You're on the right track. Whether you need to sample all the threads depends on whether they are doing the same thing or different things. It is not essential to sample them all at the same time. You need to look at the threads that are actually working, not just idling. Some points:

  • Sampling should be on wall-clock time, not CPU time, unless you want to be blind to needless I/O and other blocking calls.

  • You're not just interested in which functions are on the stack, but which lines of code, because they convey the purpose of the time being spent. It is more useful to look for a "hot purpose" than a "hot spot".

  • The cost of a function or line of code is just the fraction of samples it appears on. To appreciate that, suppose samples are taken every 10ms for a total of N samples. If the function or line of code could be made to disappear, then all the samples in which it is on the stack would also disappear, reducing N by that fraction. That's what speedup is.

  • In spite of the last point, in sampling, quality beats quantity. When the goal is to understand what opportunities you have for speedup, you get farther faster by manually scrutinizing 10-20 samples to understand the full reason why each moment in time is being spent. That's why I take samples manually. Knowing the amount of time with statistical precision is really far less important.

  • I can't emphasize enough the importance of finding and fixing more than one problem. Speed problems come in severals, and each one you fix has a multiplier effect on those done already. The ones you don't find end up being the limiting factor.

  • Programs that involve a lot of asynchronous inter-thread message-passing are more difficult, because it becomes harder to discern the full reason why a moment in time is being spent.

More on that.

Community
  • 1
  • 1
Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
  • Thanks for your tips and also you link is very interesting. – Constantin Apr 16 '13 at 18:44
  • However I just do not understand why I can not find anything like "A sampling profiler is useless when applied to a parallel program..." in literature or anywhere else. For example Parallel Studio from Intel has a heavy focus on multicore programming but they do not say anywhere that the Parallel Amplifier can only be used for finding hotspots in a sequential program. Do you know literature/profiling tutorial which discusses this topic? – Constantin Apr 16 '13 at 18:50
  • @Constantin: I'm not sure what you're looking for. It really depends on the nature of the program, like if you have a weather model, anything you do to sample and speed up one thread will speed them all up, because they're all sharing the same code. If you have a multi-thread Java app that serves simultaneous users, sample and speedup one thread when it is doing something serious. Then they'll all be faster when they are doing that thing, again because they share code. Seems to me it only gets difficult when you get to my last point, and you don't need multiple threads to have that problem. – Mike Dunlavey Apr 16 '13 at 20:00
  • As far as I understand you are saying that sampling still works in a parallel program, when you are trying to diagnose a specific part of the program from which you know it is using a lot of cpu time. This is a valid point - however what I am trying to acieve is to get statistical data in order to identify this cpu intensive parts. I mean, this statistical hotspot analysis is a common technique for sequential programs - and I am surprised that it seems to no longer work when applying it to parallel programs. – Constantin Apr 16 '13 at 20:37
  • @Constantin: and what you call hotspot analysis doesn't find problems resulting in performance improvement in sequential programs either. It's simple enough to comb the internet for cases. I haven't seen any that claim more than maybe 40% speedup. What appears to happen is people run profilers, display the numbers, can't find any problems to fix, and then happily conclude *there aren't any* ! Go ahead - do a search. If you find any, let me know. – Mike Dunlavey Apr 16 '13 at 21:28