Is it realistic to use -O3 or -Ofast to compile your benchmark code or will it remove code?

Question

When compiling the benchmark code below with -O3 I was impressed by the difference it made in latency so i began to wonder whether the compiler is not "cheating" by removing code somehow. Is there a way to check for that? Am I safe to benchmark with -O3? Is it realistic to expect 15x gains in speed?

Results without -O3: Average: 239 nanos Min: 230 nanos (9 million iterations)
Results with-O3: Average: 14 nanos, Min: 12 nanos (9 million iterations)

int iterations = stoi(argv[1]);
int load = stoi(argv[2]);

long long x = 0;

for(int i = 0; i < iterations; i++) {

    long start = get_nano_ts(); // START clock

    for(int j = 0; j < load; j++) {
        if (i % 4 == 0) {
            x += (i % 4) * (i % 8);
        } else {
            x -= (i % 16) * (i % 32);
        }
    }

    long end = get_nano_ts(); // STOP clock

    // (omitted for clarity)
}

cout << "My result: " << x << endl;

Note: I am using clock_gettime to measure:

long get_nano_ts() {
    struct timespec ts;
    clock_gettime(CLOCK_MONOTONIC, &ts);
    return ts.tv_sec * 1000000000 + ts.tv_nsec;
}

Think about it. if the compiler removed code of any importance your program wouldn't work as expected. — Captain Obvlious, Jul 02 '15 at 22:43
@CaptainObvlious Why does it make such a huge difference? Is 15x gains common / expected? — LatencyGuy, Jul 02 '15 at 22:44
You need to do something unremovable with `x` such as outputting it. — David Schwartz, Jul 02 '15 at 22:44
@DavidSchwartz I am doing it. And it prints the correct value. So I guess I have to assume that `-O3` just makes my code 15 times faster. Is that a fair assumption? — LatencyGuy, Jul 02 '15 at 22:45
@LatencyGuy It's not an assumption. It's a fact. It's computing the result 15 times faster. — David Schwartz, Jul 02 '15 at 22:46
15x is not unheard of (it depends, of course, on what you're doing!) Since you get the same result with your optimized code (and you're forcing the result to actually be computed by printing it out, so the compiler is not allowed to elide the computation entirely) it seems this is the real gain. I've seen much higher gains than that, by the way (and lower too). — Cameron, Jul 02 '15 at 22:54
In this case, I suspect most of the gains are due to variables that were being loaded from/to the stack in O0 now have proper register allocation, drastically reducing the overhead of the inner loop. Also, both expressions in the inner if bodies are constant within the inner loop, and are probably hoisted outside of it. In fact, the entire inner loop might have been rewritten with the ifs on the outside of it instead of inside. Would be interesting to check the assembly generated and see. — Cameron, Jul 02 '15 at 22:58
While we are at it: I did [this](http://stackoverflow.com/questions/24468397/a-timer-for-arbitrary-functions) a while a go for fun, might be useful for you. Technically could suffer from the reordering fiasco too of course. — Baum mit Augen, Jul 02 '15 at 23:08
Oh, and a possibly useful benchmarking tip: Make a separate C or C++ file that calls your optimized function in a loop. Compile that one file with -O0, but compile your other code with -O3. That way you can be sure all the calls are made. — Zan Lynx, Jul 02 '15 at 23:26
Btw, the problem about the reordering of the timer calls was [asked before](http://stackoverflow.com/questions/9053849/prevent-code-being-moved-by-gcc-in-benchmark-code). Apparently, it is (usually) not a problem. (Although the top answer does not fully convince me yet tbh.) — Baum mit Augen, Jul 02 '15 at 23:36
@ZanLynx That's a great idea, Zan. It would be nice if there were a directive that you could put around your code saying DO NOT OPTIMIZE HERE. — LatencyGuy, Jul 03 '15 at 00:09

score 2 · Answer 1 · edited May 23 '17 at 12:14

2

The compiler will certainly be "cheating" and removing unnecessary code when compiling with optimization enabled. It actually goes great length to speed up your code which almost always will lead to impressive speed-ups. If it was somehow able to derive a formula that calculates the result in constant time instead of using this loop, it would. A constant factor 15 is nothing out of the ordinary.

But this does not mean that you should profile un-optimized builds! Indeed, when using languages like C and C++, the performance of un-optimized builds is pretty much completely meaningless. You need not worry about that at all.

Of course, this can interfere with micro-benchmarks as the one you showed above. Two points to that:

More often than not, such micro optimization do not matter either. Prefer profiling your actual program and then removing bottlenecks.
If you actually want such a micro benchmark, make it depend on some runtime input and display the result. That way, the compiler cannot remove the functionality itself, just make it reasonably fast.

Since you seem to be doing that, the code you show has a good chance of being a reasonable micro benchmark. One thing you should watch out for is whether your compiler moves both calls to get_nano_ts(); to the same side of the loop. It is allowed to do this since "run time" does not count as observable side effect. (The standard does not even mandate your machine operating at finite speed.) It was argued here that this usually is not a problem, though I cannot really judge whether the answer given is valid or not.

If your program does not do anything expensive other then the thing you want to benchmark (which it, if possible, should not do anyways), you can also move the time measurement "outside", e.g. with time.

edited May 23 '17 at 12:14

Community

1
1

answered Jul 02 '15 at 22:49

Baum mit Augen

49,044
25
144
182

Wow! How can I make sure it is not moving my `get_nano_ts()` call? That would be a total loss! – LatencyGuy Jul 02 '15 at 23:03
@LatencyGuy Print it. As a rule of thumb compiler cannot change the visible output of the program (there are few exceptions). This means the order of operations that produce side effects cannot be changed. – luk32 Jul 02 '15 at 23:05
@LatencyGuy In this case I would recommend the `time` method I mentioned at the very end of my answer. In general, [`volatile`](http://en.cppreference.com/w/cpp/language/cv) can help you to explicitly disable optimization for certain things. – Baum mit Augen Jul 02 '15 at 23:05
@luk32 But as long as it keeps the relative order of the prints, it can reorder the stuff that happens between the prints. Again, execution time does *not* count as visible side effect. So printing will not be enough to guarantee anything. – Baum mit Augen Jul 02 '15 at 23:07
@luk32 I guess I will just use the `start` time on my calculations, although that will give me a different answer every time. Now the `stop` time I am not sure what to do. Man, if the compiler moves around my `get_nano_ts()` then it can become madness to benchmark, no? – LatencyGuy Jul 02 '15 at 23:13
@LatencyGuy As I said. you can prevent some moving around with `volatile`. If you care about the details for that, search or post a new question. Keep in mind that benchmarking a single isolated function or loop often is not too helpful at the end. To benchmark your whole big program, you use a [profiler](https://en.wikipedia.org/wiki/Profiling_(computer_programming)). – Baum mit Augen Jul 02 '15 at 23:22

score 1 · Accepted Answer · answered Jul 02 '15 at 23:28

It can be very difficult to benchmark what you think you are measuring. In the case of the inner loop:

for (int j = 0;  j < load;  ++j)
        if (i % 4 == 0)
                x += (i % 4) * (i % 8);
        else    x -= (i % 16) * (i % 32);

A shrewd compiler might be able to see through that and change the code to something like:

 x = load * 174;   // example only

I know that isn't equivalent, but there is some fairly simple expression which can replace that loop.

The way to be sure is to use the gcc -S compiler option and look at the assembly code it generates.

Galik · Answer 3 · 2015-07-02T23:00:06.380

0

You should always benchmark with optimizations turned on. However it is important to make sure the things you want to time do not get optimized away by the compiler.

One way to do this by printing out calculation results after the timer has stopped:

long long x = 0;

for(int i = 0; i < iterations; i++) {

    long start = get_nano_ts(); // START clock

    for(int j = 0; j < load; j++) {
        if (i % 4 == 0) {
            x += (i % 4) * (i % 8);
        } else {
            x -= (i % 16) * (i % 32);
        }
    }

    long end = get_nano_ts(); // STOP clock

    // now print out x so the compiler doesn't just ignore it:
    std::cout << "check: " << x << '\n',

    // (omitted for clarity)
}

When comparing benchmarks for several different algorithms that can also serve as a check that each algorithm is producing the same results.

edited Jul 02 '15 at 23:00

answered Jul 02 '15 at 22:47

Galik

47,303
4
80
117

If the compiler can get rid of code why would you want to profile it since it won't be in a production build anyway. Enabling optimizations for benchmarking is absolutely desirable and rather than using timer functions use a profile and get an accurate picture of your entire code. – Captain Obvlious Jul 02 '15 at 22:51
@LatencyGuy In that case you are not losing any code that you can not afford to lose through optimization. because optimization won't change the functionality. Its just important to remembr that if you don't use the result, the optimizer may just remove the entire calculation. – Galik Jul 02 '15 at 22:52
1

@CaptainObvlious Bench mark programs are very artificial environments for code and you often want to test algorithms that, in real code, won't get optimized away but in your artificial bench test they can be if the compiler realizes you never use the results. – Galik Jul 02 '15 at 22:55

Is it realistic to use -O3 or -Ofast to compile your benchmark code or will it remove code?

3 Answers3

Linked