Profiling a Single Function Predictably

Question

I need a better way of profiling numerical code. Assume that I'm using GCC in Cygwin on 64 bit x86 and that I'm not going to purchase a commercial tool.

The situation is this. I have a single function running in one thread. There are no code dependencies or I/O beyond memory accesses, with the possible exception of some math libraries linked in. But for the most part, it's all table look-ups, index calculations, and numerical processing. I've cache aligned all arrays on the heap and stack. Due to the complexity of the algorithm(s), loop unrolling, and long macros, the assembly listing can become quite lengthy -- thousands of instructions.

I have been resorting to using either, the tic/toc timer in Matlab, the time utility in the bash shell, or using the time stamp counter (rdtsc) directly around the function. The problem is this: the variance (which might be as much as 20% of the runtime) of the timing is larger than the size of the improvements I'm making, so I have no way of knowing if the code is better or worse after a change. You might think then it's time to give up. But I would disagree. If you are persistent, many incremental improvements can lead to a two or three times performance increase.

One problem I have had multiple times that is particularly maddening is that I make a change and the performance seems to improve consistently by say 20%. The next day, the gain is lost. Now it's possible I made what I thought was an innocuous change to the code and then completely forgot about it. But I'm wondering if it's possible something else is going on. Like maybe GCC doesn't yield a 100% deterministic output as I believe it does. Or maybe it's something simpler, like the OS moved my process to a busier core.

I have considered the following, but I don't know if any of these ideas are feasible or make any sense. If yes, I would like explicit instructions on how to implement a solution. The goal is to minimize the variance of the runtime so I can meaningfully compare different versions of optimized code.

Dedicate a core of my processor to run only my routine.
Direct control over the cache(s) (load it up or clear it out).
Ensuring my dll or executable always loads to the same place in memory. My thinking here is that maybe the set-associativity of the cache interacts with the code/data location in RAM to alter performance on each run.
Some kind of cycle accurate emulator tool (not commercial).
Is it possible to have a degree of control over context switches? Or does it even matter? My thinking is the timing of the context switches is causing variability, maybe by causing the pipeline to be flushed at an inopportune time.

In the past I have had success on RISC architectures by counting instructions in the assembly listing. This only works, of course, if the number of instructions is small. Some compilers (like TI's Code Composer for the C67x) will give you a detailed analysis of how it's keeping the ALU busy.

I haven't found the assembly listings produced by GCC/GAS to be particularly informative. With full optimization on, code is moved all over the place. There can be multiple location directives for a single block of code dispersed about the assembly listing. Further, even if I could understand how the assembly maps back into my original code, I'm not sure there's much correlation between instruction count and performance on a modern x86 machine anyway.

I made a weak attempt at using gcov for line-by-line profiling, but due to an incompatibility between the version of GCC I built and the MinGW compiler, it wouldn't work.

One last thing you can do is average over many, many trial runs, but that takes forever.

EDIT (RE: Call Stack Sampling)

The first question I have is, practically, how do I do this? In one of your power point slides, you showed using Visual Studio to pause the program. What I have is a DLL compiled by GCC with full optimizations in Cygwin. This is then called by a mex DLL compiled by Matlab using the VS2013 compiler.

The reason I use Matlab is because I can easily experiment with different parameters and visualize the results without having to write or compile any low level code. Further, I can compare my optimized DLL to the high level Matlab code to ensure my optimizations have not broken anything.

The reason I use GCC is that I have a lot more experience with it than with Microsoft's compiler. I'm familiar with many flags and extensions. Further, Microsoft has been reluctant, at least in the past, to maintain and update the native C compiler (C99). Finally, I've seen GCC kick the pants off commercial compilers, and I've looked at the assembly listing to see how it's actually done. So I have some intuition of how the compiler actually thinks.

Now, with regards to making guesses about what to fix. This isn't really the issue; it's more like making guesses about how to fix it. In this example, as is often the case in numerical algorithms, there is really no I/O (excluding memory). There are no function calls. There's virtually no abstraction at all. It's like I'm sitting on top of a piece of saran wrap. I can see the computer architecture below, and there's really nothing in-between. If I re-rolled up all the loops, I could probably fit the code on about one page or so, and I could almost count the resultant assembly instructions. Then I could do a rough comparison to the theoretical number of operations a single core is capable of doing to see how close to optimal I am. The trouble then is I lose the auto-vectorization and instruction level parallelization I got from unrolling. Unrolled, the assembly listing is too long to analyze in this way.

The point is that there really isn't much to this code. However, due to the incredible complexity of the compiler and modern computer architecture, there is quite a bit of optimization to be had even at this level. But I don't know how small changes are going to affect the output of the compiled code. Let me give a couple of examples.

This first one is somewhat vague, but I'm sure I've seen it happen a few times. You make a small change and get a 10% improvement. You make another small change and get another 10% improvement. You undo the first change and get another 10% improvement. Huh? Compiler optimizations are neither linear, nor monotonic. It's possible, the second change required an additional register, which broke the first change by forcing the compiler to alter its register allocation algorithm. Maybe, the second optimization somehow occluded the compiler's ability to do optimizations which was fixed by undoing the first optimization. Who knows. Unless the compiler is introspective enough to dump its full analysis at every level of abstraction, you'll never really know how you ended up with the final assembly.

Here is a more specific example which happened to me recently. I was hand coding AVX intrinsics to speed up a filter operation. I thought I could unroll the outer loop to increase instruction level parallelism. So I did, and the result was that the code was twice as slow. What happened was there were not enough 256 bit registers to go around. So the compiler was temporarily saving results on the stack, which killed performance.

As I was alluding to in this post, which you commented on, it's best to tell the compiler what you want, but unfortunately, you often have no choice and are forced to hand tweak optimizations, usually via guess and check.

So I guess my question would be, in these scenarios (the code is effectively small until unrolled, each incremental performance change is small, and you're working at a very low level of abstraction), would it be better to have "precision of timing" or is call stack sampling better at telling me which code is superior?

You can have a look at kcachegrind to profile, especially cache access. (cache misses are shown). — Eliot B., Nov 24 '16 at 16:20
To attach a process to a CPU you have `CPU_SET` and `sched_setaffinity` — Eliot B., Nov 24 '16 at 16:24
I work for a commercial company that mostly operates in the windows world. Are either of these options available on windows? — Todd, Nov 25 '16 at 15:10
There seems to be **Qcachegrind** for Windows however `sched_setaffinity` is a linux-specific call. — Eliot B., Nov 25 '16 at 15:36

yugr · Answer 1 · 2016-11-25T09:54:38.027

I've faced a similar problem some time ago but that was on Linux which made it easier to tweak. Basically the noise introduced by OS (called "OS jitter") was as big as 5-10% in SPEC2000 tests (I can imagine it's much higher on Windows due to much bigger amount of bloatware).

I was able to bring deviation to below 1% by combination of the following:

disable dynamic frequency scaling (better do this both in BIOS and in Linux kernel as not all kernel versions do this reliably)
disable memory prefetching and other fancy settings like "Turbo boost", etc. (BIOS, again)
disable hyperthreading
enable high-performance process scheduler in kernel
bind process to core to prevent thread migration (use core 0 - for some reason it was more reliable on my kernel, go figure)
boot to single-user mode (in which no services are running) - this isn't as easy in modern systemd-based distros
disable ASLR
disable network
drop OS pagecache

There may be more to it but 1% noise was good enough for me.

I might put detailed instructions to github later today if you need them.

-- EDIT --

I've published my benchmarking script and instructions here.

This is a helpful answer, although I was hoping for a windows oriented solution. If no one else makes a major contribution, I'll accept this as my answer. — Todd, Nov 25 '16 at 15:37
Yeah, this does not exactly answer your question but at least lists most common reasons for performance jitter. Sorry to say that but Windows would probably much more hard to tame (if at all). — yugr, Nov 25 '16 at 15:43

score 1 · Accepted Answer · edited Apr 13 '17 at 12:53

1

Am I right that what you're doing is making an educated guess of what to fix, fixing it, and then trying to measure to see if it made any difference?

I do it a different way, which works especially well as the code gets large. Rather than guess (which I certainly can) I let the program tell me how the time is spent, by using this method. If the method tells me that roughly 30% is spent doing such-and-so, I can concentrate on finding a better way to do that. Then I can run it and just time it. I don't need a lot of precision. If it's better, that's great. If it's worse, I can undo the change. If it's about the same, I can say "Oh well, maybe it didn't save much, but let's do it all again to find another problem,"

I need not worry. If there's a way to speed up the program, this will pinpoint it. And often the problem is not just a simple statement like "line or routine X spends Y% of the time", but "the reason it's doing that is Z in certain cases" and the actual fix may be elsewhere. After fixing it, the process can be done again, because a different problem, which was small before, is now larger (as a percent, because the total has been reduced by fixing the first problem). Repetition is the key, because each speedup factor multiplies all the previous, like compound interest.

When the program no longer points out things I can fix, I can be sure it is nearly optimal, or at least nobody else is likely to beat it.

And at no point in this process did I need to measure the time with much precision. Afterwards, if I want to brag about it in a powerpoint, maybe I'll do multiple timings to get smaller standard error, but even then, what people really care about is the overall speedup factor, not the precision.

edited Apr 13 '17 at 12:53

Community

1
1

answered Nov 25 '16 at 17:03

Mike Dunlavey

40,059
14
91
135

I'm going to read the full post as soon as I have the energy to try and understand. I'm ignorant yet, but it seems it would be impossible to determine whether a fix is better or worse without either controlling most sources of non-determinism such that the "OS Jitter" is smaller than the size of the change (as I mentioned) or a lot of averaging (which is time intensive). – Todd Nov 25 '16 at 20:04
@Todd: If you're new to it it can seem strange, but all the upvotes [*here*](http://stackoverflow.com/a/378024/23771) are not by accident. And, while every program is different, I like to show [*this example*](http://softwareengineering.stackexchange.com/a/302345/2429) where the speedup was 730x. There are two kinds of precision: precision of timing (without being too sure what takes the time), and precision of *diagnosis* (where the time it takes is what it is, but if it's more or less doesn't really affect what you do about it). – Mike Dunlavey Nov 26 '16 at 02:18
I'm very interested in this, and if you write a book, I may read it some day. Some of the probability theory would be very helpful to understand if it was defined from first principles with lots of good examples. However, I'm not convinced that this technique is applicable to my problem. The issue at hand is really how do I speed things up, not where is the problem. I'm going to edit my post; then you can decide if this makes sense. – Todd Nov 26 '16 at 18:36
I just used your sampling method on an unrelated program written by a coworker in c#. The manual sampling is stopping like 80% on a call to an external utility. The VS2013 sampling profiler is telling me that about 90% of the time is being taken up in some internal routines related to a lot of string manipulations. I have no idea what happens when you hit pause in VS. Is it possible the call to an external program is strongly biasing where/when the program halts when using manual sampling? Which should I trust? – Todd Nov 28 '16 at 22:42
1

@Todd: Unfortunately, a property of sampling-based profilers (other than random pausing and maybe Zoom) is that they are blind to I/O or other system waits. There is *no good reason* for this. It is just a historical accident, because the first sampling profilers sampled the instruction pointer, which is meaningless during I/O. That, plus the weird idea that precision of timing helps you find problems, led others to follow suit. [*Here are the issues.*](http://stackoverflow.com/a/1779343/23771) The long and short is, sampling VS will not show you the problem. – Mike Dunlavey Nov 29 '16 at 00:17
You were correct about the sampling profiler. I'm still not so sure about the how vs where optimization issue. However, it hasn't even been a week, and this information was invaluable in quickly identifying this other optimization issue I was asked to look at, so I accepted your response as being the most useful to me. – Todd Nov 30 '16 at 22:42
@Todd: Glad you got some value out of it. On the how-vs-where issue, let me dig for a poor analogy. Somebody could say "Twice I saw your spouse shuffling home plastered at 2am from Abner's bar on 27th st.". On the other hand, you could just time when he/she comes in, measured to the second on multiple occasions, and then suspect the drinking. The first one tells you everything you need to know, and the fact that it is seen twice tells you the problem is major. The second one gives you precise numbers but leaves you guessing about the problem. – Mike Dunlavey Dec 01 '16 at 01:03

Profiling a Single Function Predictably

2 Answers2