A code works slowly because of compiler optimizations

Question

I have 3 separate global functions and I want to test it's speed. I'm using this code:

        // case 1
        {
            chrono::duration<double, milli> totalTime;
            for (uint32_t i{ 0 }; i < REPEATS; ++i)
            {
                auto start = chrono::steady_clock::now();
                func1(); // "normal" c++ code
                auto end = chrono::steady_clock::now();

                auto diff = end - start;
                cout << chrono::duration <double, milli>(diff).count() << " ms" << endl;
            }
        }

        // case 2
        {
            chrono::duration<double, milli> totalTime;
            for (uint32_t i{ 0 }; i < REPEATS; ++i)
            {
                auto start = chrono::steady_clock::now();
                func2(); // multithreaded c++ code
                auto end = chrono::steady_clock::now();

                auto diff = end - start;
                cout << chrono::duration <double, milli>(diff).count() << " ms" << endl;
            }
        }

        // case 3
        {
            chrono::duration<double, milli> totalTime;
            for (uint32_t i{ 0 }; i < REPEATS; ++i)
            {
                auto start = chrono::steady_clock::now();
                func3(); // SIMD c++ code
                auto end = chrono::steady_clock::now();

                auto diff = end - start;
                cout << chrono::duration <double, milli>(diff).count() << " ms" << endl;
            }
        }

This func1(), func2(), func3() are global functions that doesn't change the state of the program (I don't have any global variables).

The outputed result depends on the running cases. If I run case 1 and case 2 I have 100ms and 10ms respectively. If I run case 1 and case 3 I have 100ms and 130ms. If I run cases 1, 2, 3 I have 130ms, 10ms, 120 ms. The first case became slower on 30% and the third one became faster! If I run cases separatelly I have 100ms, 10ms, 130ms. I tried to turn optimisation off - the code became (surprise, surprise) much slower but at least the results are the same not depending on cases order. So I came to a conclusion that compiler do something special. Is it true?

I'm using Win7 and VS 2013.

Why you doesn't provide us the functions which are called? :o How would anyone know what optimisation the compiler possible do without knowing the functions? — Melkon, Sep 10 '15 at 09:51
Order matters if there is common data that needs to be loaded from memory or disk for the first call, then are cached or buffered for the consecutive calls. — Some programmer dude, Sep 10 '15 at 09:52
*"So I came to a conclusion that compiler do something special. Is it true?"* Yes, they do! Without more information from your side, that is all the information I can provide. — MikeMB, Sep 10 '15 at 09:55
also cache misses count a lot as well. I would suggest you to warm up the cache by looping through the methods several time and computing an average of how much every function takes. A single measurement is not relevant for any performance report. — dau_sama, Sep 10 '15 at 09:58
@dau_sama If you'll take a closer look you'll notice that I'm using `REPEAT` in the loop. — nikitablack, Sep 10 '15 at 12:48
how many times do you repeat it though? instead of dumping out the timing at every iteration, store them in a vector and print out the average at the end. — dau_sama, Sep 10 '15 at 13:15

score 4 · Answer 1 · answered Sep 10 '15 at 10:12

A few things can happen:

Your test get's pre-empted by the kernel. Not much you can do to prevent that. Run your tests multiple times to make sure the results are consistent.
The compiler can inline your functions and optimize on the inlined code.
There is interaction between the functions. The simplest thing I would guess is memory allocation (i.e. func1() is optimized to request a memory block sufficient for the entire program, or one of the function brings some memory blocks into cache).

So suggestions:

First run each function before the benchmark to get rid of some of the memory artifacts.
Run the benchmark a few times and see how the values fluctuate to eliminate some of the OS artifacts.
Shuffle the order of the functions on each run, or take it as a parameter, to make sure you eliminate ordering artifacts.
Don't do cout in your loop because that interacts with the OS and can mess up your cache or even get your process preempted. Write the results to some vector and output everything at the end.

There can be other things affecting your performance (i.e. Disk, networking, other processes, memory and cpu load), so take the values with a grain of salt.

score 2 · Answer 2 · edited May 23 '17 at 11:58

Compiler optimizations vary. There are numerous things that compiler can do - one of optimizations(at least in GNU GCC) is aggresive loop unrolling - this might create faster code, but you must be aware that this can cause cache misses, effectively slowing down your code. That is, if we take just the compiler optimizations into consideration.

Now you have three different cases that if run separately give different output. This might be affected by alignment issue - if your code is properly aligned it will be faster, and if it isn't, additional padding might slow it down - I've seen similar thing happening in C#, but I can't find this thread now.

And last thing that could happen is you run too few tests to be sure - 10k tests is decent set, and you can start comparing speed output. One-time tests can be affected by OS, so keep that in mind.

Oh, and because Microsoft is brilliant at writing compilers, there are bugs in certain versions. I don't think the world of Microsoft's C++ compiler - there are many hacks and workarounds, it's not as up-to-date as other popular compilers - but that's simply my opinion. So another option is that compiler is simply malfunctioning. Also, see this and this beautiful typedef.

A code works slowly because of compiler optimizations

2 Answers2