2

Let's say I want to benchmark two competing implementations of some function double a(double b, double c). I already have a large array <double, 1000000> vals from which I can take input values, so my benchmarking would look roughly like this:

//start timer here
double r;
for (int i = 0; i < 1000000; i+=2) {
    r = a(vals[i], vals[i+1]);
}
//stop timer here

Now, a clever compiler could realize that I can only ever use the result of the last iteration and simply kill the rest, leaving me with double r = a(vals[999998], vals[999999]). This of course defeats the purpose of benchmarking.

Is there a good way (bonus points if it works on multiple compilers) to prevent this kind of optimization while keeping all other optimizations in place?

(I have seen other threads about inserting empty asm blocks but I'm worried that might prevent inlining or reordering. I'm also not particularly fond of the idea of adding the results sum += r; during each iteration because that's extra work that should not be included in the resulting timings. For the purposes of this question, it would be great if we could focus on other alternative solutions, although for anyone interested in this there is a lively discussion in the comments where the consensus is that += is the most appropriate method in many cases. )

us2012
  • 16,083
  • 3
  • 46
  • 62
  • Why should you want to do that, exactly? – Bartek Banachewicz Mar 09 '13 at 11:12
  • Because I have two different implementations of `a` and I want to compare them. – us2012 Mar 09 '13 at 11:13
  • reordering is normally done as an optimization, afaik. – Tony The Lion Mar 09 '13 at 11:13
  • 6
    Then use `+=` for both versions. – Mysticial Mar 09 '13 at 11:13
  • @Mysticial I know I might be obsessing a bit too much about the overhead of one simple addition, but I'd still be interested to see whether there's a general way to tackle this. – us2012 Mar 09 '13 at 11:14
  • 4
    @us2012 I highly doubt that a `+=` is going to be "too much overhead" based on what I'm seeing so far. – Mysticial Mar 09 '13 at 11:15
  • put the testable functions in other translation units. that way the compiler doesn't know anything about the functions, doesn't inline them etc. thus, it won't be able to 'optimize the calls to these functions' out since they might have side effects (eg code which does things on global variables etc). – akira Mar 09 '13 at 11:17
  • 4
    @us2012: that *is* the general way to tackle it. There are plenty of specialized ways to do it (place the function in a separate translation unit and disable link-time code gen, or place it in a dll, so the compiler doesn't know the function call could be omitted. But the *general* approach is to make sure the result from the call is used. And if you care about performance, then you owe it to yourself to have at least a basic understanding of modern CPU performance. And then you'll know that the overhead of that `+=` operation is basically nil. – jalf Mar 09 '13 at 11:17
  • *Now, a clever compiler could realize that I can only ever use the result of the last iteration and simply kill the rest* That would be a stupid compiler. `a` might have side-effects. – ta.speot.is Mar 09 '13 at 11:18
  • Tricks like moving the function to another translation unit, or putting empty `asm` blocks or similar might preclude certain optimizations. Simply using the result of the function as @Mysticial suggests allows the compiler to optimize fully, so it will really give you the most realistic results – jalf Mar 09 '13 at 11:19
  • @jalf: `-O2` yields the same code, even in a different translation unit. it's about as realistic as having it in the same TU. except for code that is supposed to be inlined. the relative speed-difference between 2 inlined-functions or 2 usual-functions is about to be the same. – akira Mar 09 '13 at 11:22
  • @jalf Thanks for your suggestions, but I don't quite see how this turned into "you don't have a basic understanding of". I have edited my question to make amends for the fact that `+=` may be a valid solution for the practical problem at hand - but as stated, I knew about `+=` before I asked this question. – us2012 Mar 09 '13 at 11:24
  • @us2012: I don't whether you have a basic understanding of a modern CPU's performance characteristics (although if you are worried that the `+=` may skew your results, it would appear not). But none of this changes the basic fact that this is the best, simplest, most accurate and most reliable approach. It also has *far* less overhead that the other solutions – jalf Mar 09 '13 at 12:21
  • 1
    By the way, I love that your question states that you're worried about solutions that may prevent inlining, and then you accept the answer which *guarantees** that inlining will be prevented, and which involves actually disabling certain compiler optimizations. While, apparently, ignoring the solution that will allow the compiler to inline as much as it would otherwise. – jalf Mar 09 '13 at 12:22
  • @jalf I am not ignoring your solution at all. I have upvoted Mysticial's comment stating that the overhead is negligible and I have edited my question to point readers towards those comments. Daniel's answer provides two approaches, none of which may be perfect but both of which are interesting, so I have accepted it (maybe a little early, but hey). I realize that prior to my last edit, the question was a little passive-aggressive towards the suggestion of `+=`, but I do hope that the tone of it is now friendlier and more professional. – us2012 Mar 09 '13 at 12:34
  • An `PADDxx` instruction that is used for adding a double in SSE code will take 6 clock cycles on the latest AMD processors. Probably similar on Intel. Anything else you do is likely to take AT LEAST that long. – Mats Petersson Mar 09 '13 at 12:35
  • use a function pointer to call different `a` implementations. – auselen Mar 11 '13 at 12:01

1 Answers1

4

Put a in a separate compilation unit and do not use LTO (link-time optimizations). That way:

  • The loop is always identical (no difference due to optimizations based on a)
  • The overhead of the function call is always the same
  • To measure the pure overhead and to have a baseline to compare implementations, just benchmark an empty version of a

Note that the compiler can not assume that the call to a has no side-effect, so it can not optimize the loop away and replace it with just the last call.


A totally different approach could use RDTSC, which is a hardware register in the CPU core that measures the clock cycles. It's sometimes useful for micro-benchmarks, but it's not exactly trivial to understand the results correctly. For example, check out this and goggle/search SO for more information on RDTSCs.

Community
  • 1
  • 1
Daniel Frey
  • 55,810
  • 13
  • 122
  • 180