C++ code execution time varies with small source change that shouldn't introduce any extra work

Question

While working on benchmarking some code, I found that its execution time would vary with even the most innocuous code changes.

I have attempted to boil down the code below to the most minimal test case, but it is still rather lengthy (for which I apologize). Changing virtually anything largely affects the benchmark results.

#include <string>
#include <vector>
#include <iostream>
#include <random>
#include <chrono>
#include <functional>

constexpr double usec_to_sec = 1000000.0;

// Simple convenience timer
class Timer
{
    std::chrono::high_resolution_clock::time_point start_time;
public:
    Timer() : start_time(std::chrono::high_resolution_clock::now()) { }
    int64_t operator()() const {
        return static_cast<int64_t>(
        std::chrono::duration_cast<std::chrono::microseconds>(
            std::chrono::high_resolution_clock::now()-start_time).count()
        );
    }
};

// Convenience random number generator
template <typename T>
class RandGen
{
    mutable std::default_random_engine generator;
    std::uniform_int_distribution<T> distribution;

    constexpr unsigned make_seed() const {
        return static_cast<unsigned>(std::chrono::system_clock::now().time_since_epoch().count());
    }
public:
    RandGen(T min, T max) : generator(make_seed()), distribution(min, max) { }
    T operator ()() { return distribution(generator); }
};

// Printer class
class Printer
{
    std::string filename;
    template <class S>    
    friend Printer &operator<<(Printer &, S &&s);
public:
    Printer(const char *filename) : filename(filename) {}
};

template <class S>
Printer &operator<<(Printer &pm, S &&s) {
    std::cout << s;
    return pm;
}

// +------------+
// | Main Stuff |
// +------------+
void runtest(size_t run_length)
{
    static RandGen<size_t> word_sz_generator(10, 20);
    static RandGen<int> rand_char_generator(0, 25);

    size_t total_char_count = 0;
    std::vector<std::string> word_list;
    word_list.reserve(run_length);

    Printer printer("benchmark.dat");
    printer << "Running test... ";

    Timer timer; // start timer
    for (auto i = 0; i < run_length; i++) {

        size_t word_sz = word_sz_generator();
        std::string word;
        for (auto sz = 0; sz < word_sz; sz++) {
            word.push_back(static_cast<char>(rand_char_generator())+'a');
        }
        word_list.emplace_back(std::move(word));
        total_char_count += word_sz;
    }
    int64_t execution_time_usec = timer(); // stop timer

    printer << /*run_length*/ word_list.size() << " words, and " 
            << total_char_count << " total characters, were built in "
            << execution_time_usec/usec_to_sec << " seconds.\n";
}

int main(int argc, char **argv)
{
    constexpr size_t iterations = 30;
    constexpr size_t run_length = 50000000;

    for (auto i = 0; i < iterations; i++)
        runtest(run_length);

    return EXIT_SUCCESS;
}

The 1^st class, Timer, is just a small convenience class (intentionally not well-featured, for brevity) for timing the code.

I tried to do without the 2^nd class RandGen (which just generates random values), but any attempt to exclude this from the test code made the problem auto-magically disappear. So, I suspect the issue has something to do with it. But I can't figure out how.

The 3^rd class Printer seems entirely unnecessary for this question, but again, including it seems to exacerbate the issue.

So, now we're down to main() (which just runs the test) and runtest().

runtest() is hideous, so please don't look at it from a "clean code" point-of-view. Changing it in any way (ex. moving the inner for loop to its own function) results in a change in benchmark results. The simplest, and most perplexing example is the last line:

printer << /*run_length*/ word_list.size() << " words, and " 
        << total_char_count << " total characters, were built in "
        << execution_time_usec/usec_to_sec << " seconds.\n";

In the line above, run_length and word_list.size() are the same. The size of vector word_list is defined by run_length. But, if I run the code as-is, I get an average execution time of 9.8 seconds, whereas if I uncomment run_length and comment-out word_list.size(), the execution time actually increases to an average of 10.6 seconds. I can't fathom how such an insignificant code change could affect the timing of the whole program to such an extent.

In other words...

9.8 seconds:

printer << /*run_length*/ word_list.size() << " words, and " 
        << total_char_count << " total characters, were built in "
        << execution_time_usec/usec_to_sec << " seconds.\n";

10.6 seconds:

printer << run_length /*word_list.size()*/ << " words, and " 
        << total_char_count << " total characters, were built in "
        << execution_time_usec/usec_to_sec << " seconds.\n";

I have repeated the exercise of commenting and uncommenting the variables noted above, and re-running the benchmarks, many times. The benchmarks are repeatable and consistent - i.e. they are consistently 9.8 seconds and 10.6 seconds, respectively.

The code output looks like this, for the two cases:

Running test... 50000000 words, and 750000798 total characters, were built in 9.83379 seconds.
Running test... 50000000 words, and 749978210 total characters, were built in 9.84541 seconds.
Running test... 50000000 words, and 749996688 total characters, were built in 9.87418 seconds.
Running test... 50000000 words, and 749995415 total characters, were built in 9.85704 seconds.
Running test... 50000000 words, and 750017699 total characters, were built in 9.86186 seconds.
Running test... 50000000 words, and 749998680 total characters, were built in 9.83395 seconds.
...

Running test... 50000000 words, and 749988517 total characters, were built in 10.604 seconds.
Running test... 50000000 words, and 749958011 total characters, were built in 10.6283 seconds.
Running test... 50000000 words, and 749994387 total characters, were built in 10.6374 seconds.
Running test... 50000000 words, and 749995242 total characters, were built in 10.6445 seconds.
Running test... 50000000 words, and 749988379 total characters, were built in 10.6543 seconds.
Running test... 50000000 words, and 749969532 total characters, were built in 10.6722 seconds.
...

Any information on what would cause this discrepancy would be greatly appreciated.

Notes:

Even removing the unused std::string filename member object from the Printer class yields different benchmark results - where doing so, eliminates (or reduces to insignificant proportions) the difference between the two benchmarks provided above.
This does not seem to be an issue when compiling with g++ (on Ubuntu). Although, I can’t say this definitively; my tests with Ubuntu were in a VM on the same Windows machine, where the VM perhaps did not have access to all of the resources and processor enhancements.
I am using Visual Studio Community 2017 (version 15.7.4)
- Compiler version: 19.14.26431
- All tests and reported results are Release Build, 64-bit
System: Win10, i7-6700K @ 4.00 GHz, 32 GB RAM

If there's an MSVC STL expert looking in, how big does a string need to be before exceeding the limits of their small string optimization? It's probably greater than 20, but there's no harm in making sure. — dgnuff, Aug 03 '18 at 01:19
I should probably add a MSVC tag. I just tried the code in g++ on Ubuntu and it seems to work as one would expect (no difference between the tests). — Blair Fonville, Aug 03 '18 at 01:31
@RetiredNinja thank you. @BlairFonville, try it again with `static RandGen word_sz_generator(10, 20);` changed to `static RandGen word_sz_generator(8, 15);`. — dgnuff, Aug 03 '18 at 01:34
@dgnuff Great thought; same results. The execution time for each test dropped, of course, but the delta between them remained. — Blair Fonville, Aug 03 '18 at 01:38
This is probably not a `[tag:benchmarking]` problem; it's likely that you are correctly measuring the execution time. The question is why a minor source change results in the compiler making slower asm. Maybe using `run_length` tricks the compiler into keeping that in a register, instead of optimizing it away and just using pointer increments? MSVC is not a very good compiler, often emitting clunkier asm than `g++ -O3` or `clang++ -O3`. (g++ and clang++ are rarely optimal either, for small to medium loops.) — Peter Cordes, Aug 03 '18 at 03:52
If you remove the RNG, then probably too much optimizes away. Did you try using a simple/fast RNG? Or maybe that + a dummy function-call to a non-inline function that just returns 0? (Maybe disable link-time optimization so the compiler can't inline it.) Non-inline function calls clobber some registers and the compiler has to assume they modify any globals, so they have an effect on code-gen. — Peter Cordes, Aug 03 '18 at 04:07
@PeterCordes No, I did not try any of the above, so I look forward to doing so tomorrow when I’m back in the office. All of your comments have been very insightful - much appreciated. I’ll let you know how it goes. — Blair Fonville, Aug 03 '18 at 04:16
What is the exact version of your compiler? I cannot reproduce your results with 19.11.25545. — geza, Aug 03 '18 at 06:08
What CPU did you test on, exactly? It's probably different from @geza's (who also only listed compiler version, not hardware). Different CPUs will be sensitive to different subtle things. — Peter Cordes, Aug 03 '18 at 06:50

score 4 · Accepted Answer · answered Aug 03 '18 at 05:26

You're probably running into some kind of code-alignment effect. Modern x86-64 CPUs are fairly robust with respect to alignment most of the time, but alignment can affect branches aliasing each other in the branch predictors (like @rcgldr mentioned), and various front-end effects.

See https://agner.org/optimize/, and performance links in the x86 tag wiki. But honestly I don't think there's any useful explanation here, other than that you've discovered your loop is sensitive to alignment effects, either from the front-end or from branch prediction. This means that even identical machine code at different alignments in your main program could have different performance.

This is a known phenomenon. An answer on Code alignment in one object file is affecting the performance of a function in another object file has some general comments about how alignment can matter, and see also Why would introducing useless MOV instructions speed up a tight loop in x86_64 assembly? There's an article somewhere about how linking object files in a different order can affect performance (and that this is an unexpected effect from the toolchain), but I couldn't find it.

You can use HW performance counters to measure branch misprediction rates to see if that explains why one version is slower than the other. Or if there's some other front-end effect.

But unfortunately there's not much you can do; trivial source differences, if they affect the asm at all, will change alignment for everything.

You can sometimes redesign things to be less sensitive to branch-prediction by replacing branches with branchless code. e.g. always generate 16 bytes of random letters, and truncate that to a random length. (Some branching on size when copying it is probably inevitable, unless creating a 16-byte std::string and then truncating it can be branchless.)

You might speed that up with SIMD, e.g. use a vectorized PRNG like with an SSE2 or AVX2 xorshift+ to generate 16 bytes of random letters at a time. (efficiently getting a uniform 0..25 distribution with packed-byte operations may be tricky, but maybe the same technique as the 0..9 distribution I used to generate 1GiB of space-separated random ASCII digits per ~0.03 seconds on a 3.9GHz Skylake would be useful. It's not perfectly uniformly distributed, though, because 65536 % 10 has a remainder (like 65536/25), but you can probably change the quality vs. speed tradeoff and still run fast.)

Comparing compiler output from the two versions

The asm for both versions of the inner loop in the runtest function are essentially identical, at least if the compiler asm output we see on the Godbolt compiler explorer matches what you're actually getting in the executable from MSVC. (Unlike with gcc/clang, its asm output can't necessarily be assembled into a working object file.) If your real release build does any link-time optimization that could inline some library code, it might make different optimization choices in the final executable.

I put in a #ifdef so I could use -DUSE_RL to have two MSVC 2017 outputs that built the same source different ways, and feed those asm outputs to a diff pane. (The diff pane is at the bottom in the messy layout that I linked; click the fullscreen box on it to show just that.)

The only differences in the whole function are:

ordering and register choice for a few instructions like mov edx, DWORD PTR _tls_index and mov QWORD PTR run_length$GSCopy$1$[rbp-121], rcx at the top of the function which only run once. (But not in code-size so they won't affect alignment later). This should have no effect on later code, and they end up making the same changes to architectural state, just using a different scratch reg which is not used again.
stack layout (position of local variables relative to RBP). But all the offsets are under +127, so they can all still use a [rbp + disp8] addressing mode.

Different code-gen from the actual source difference:

      mov     rdx, QWORD PTR word_list$[rbp-113]
      sub     rdx, QWORD PTR word_list$[rbp-121]  ; word_list.size() = end - start 
      ...
      sar     rdx, 5               ; >> 5   arithmetic right shift

vs.

      mov     rdx, rsi             ; copy run_length from another register

And no, these instructions alone can't possibly explain the speed difference. They're only run once per timing interval, before some I/O.

An extra npad 7 for alignment before a branch target near the bottom of the function (after a call _Xtime_get_ticks), after the above code difference.

There's a big block of red/green differences, but those are only from different numbering of labels, except for those three instructions at the start of the function.

But before runtest, the word_list.size() version includes code for a ??$?6_K@@YAAEAVPrinter@@AEAV0@$QEA_K@Z PROC function which doesn't appear anywhere for the version using run_length. (C++ name-mangling turns types into funky characters in the asm names of functions.) This is doing something for class Printer.

You said removing the unused std::string filename from Printer removed the code-gen difference. Well that function probably goes away with that change. IDK why MSVC decided to emit it at all, let alone only in one version vs. another.

Probably g++ -O3 doesn't have that code-gen difference, and that's why you don't see a difference. (Assuming your VM is hardware virtualization, g++-generated machine code is still running natively on the CPU. Getting a new page of memory from the OS might take a tiny bit longer in the VM, but the main time spent in the loop is probably in user-space in this code.)

BTW, gcc warns

<source>:72:24: warning: comparison of integer expressions of different signedness: 'int' and 'size_t' {aka 'long unsigned int'} [-Wsign-compare]

     for (auto i = 0; i < run_length; i++) {
                      ~~^~~~~~~~~~~~

I didn't look closely at the asm output to see if that led to worse code-gen with gcc or MSVC, or if it's just going to be unsafe if you pass large inputs.

This is fantastic. I don't recall ever having learned so much from a single answer. Learning about Godbolt alone would have made it worth all the time I spent on this yesterday (I spent all day, literally), but for your analysis, you should have charged me. Superb answer. — Blair Fonville, Aug 03 '18 at 14:30
@BlairFonville: Since you mention it, I do freelance for pay as well as for fun on SO. If you want some expert help tuning your software, let me know (peter@cordes.ca). I don't have MSVC or a Windows dev setup at all, though (just gcc/clang on Linux). — Peter Cordes, Aug 03 '18 at 20:37
Copy that, and thanks. Actually, I only use MSVC for the IDE/debugger. I then compile everything on Linux to make sure the code is cross-platform. — Blair Fonville, Aug 03 '18 at 21:42

rcgldr · Answer 2 · 2018-08-03T07:46:02.543

I ran into a similar situation, minor changes to code were having significant effects on run time. After converting the code to assembly to control location of the code, I found a significant difference on an Intel 3770K 3.5ghz processor, depending on where calls and tight loops are located in memory. The most significant difference I found was a 36.5% difference in time, mentioned in this question I posted that was originally about using indexed branching into fall through code versus a tight loop. Even more strange is that it depended on a combination of locations as noted in the comments in the assembly code (perhaps a conflict in the instruction cache?), with the loop version time ranging from 1.465 seconds to 2.000 seconds, identical code, with the only difference being nops between functions used to align the code to specific boundaries.

Indexed branch overhead on X86 64 bit mode

I don't know if other processors would experience this much difference in performance due to location of code.

Interesting. I made a late note to my post that the issue didn’t seem to occur when compiling with g++ on Ubuntu, but I’m not sure how true that is. I was running Ubuntu in a VM on the same machine. Maybe it only behaved differently there because the VM doesn’t utilize all of the processor execution enhancements. Your answer might support that caveat. — Blair Fonville, Aug 03 '18 at 02:31

dgnuff · Answer 3 · 2018-08-03T07:03:44.103

I cannot explain this, you'd probably need to take it up with Microsoft to get to the bottom of the mystery. I took your code and compiled one program that had both versions of runtest(), one of which used this line:

printer << /*run_length*/ word_list.size() << " words, and "

and the other using

printer << run_length /* word_list.size() */ << " words, and "

For the record, I was building x64 under Visual Studio 2017 community, I don't have the ability to build x86, since I've erased the x86 toolchain and libraries (along with some unnecessary ARM stuff) to gain a gigabyte or so of space back.

My benchmark results are consistent with yours, I ran a smaller set of tests but still enough to consistently show the difference:

Slower version:

Running test... 50000000 words, and 749981638 total characters, were built in 16.3966 seconds.
Running test... 50000000 words, and 750037396 total characters, were built in 15.9712 seconds.
Running test... 50000000 words, and 749999562 total characters, were built in 16.0094 seconds.
Running test... 50000000 words, and 749990566 total characters, were built in 15.8863 seconds.
Running test... 50000000 words, and 749998381 total characters, were built in 15.8728 seconds.
Running test... 50000000 words, and 749997199 total characters, were built in 15.8799 seconds.

Faster version:

Running test... 50000000 words, and 750000053 total characters, were built in 15.3437 seconds.
Running test... 50000000 words, and 750014937 total characters, were built in 15.4479 seconds.
Running test... 50000000 words, and 750054238 total characters, were built in 15.2631 seconds.
Running test... 50000000 words, and 750012691 total characters, were built in 15.5289 seconds.
Running test... 50000000 words, and 750013435 total characters, were built in 15.3742 seconds.
Running test... 50000000 words, and 749969960 total characters, were built in 15.3682 seconds.

That said, the resulting assembler for the two routines is different. Not by much, but there are differences. Comparing the two side by size, one notable difference is that one of them uses r14 where the other uses rdi, plus there are a few other minor differences.

Here's a weird one. The "word_list.size()" version has this for the iteration of the main outer loop:

    for (auto i = 0; i < run_length; i++)
00007FF7C77D2CF9  inc         r13d
00007FF7C77D2CFC  mov         dword ptr [rbp-79h],r13d
00007FF7C77D2D00  movsxd      rax,r13d
00007FF7C77D2D03  cmp         rax,qword ptr [rbp-31h]
00007FF7C77D2D07  mov         r14d,0FFFFFFFFh
00007FF7C77D2D0D  lea         rcx,[word_sz_generator (07FF7C77D70F0h)]
00007FF7C77D2D14  jb          runtest+130h (07FF7C77D2B40h)
    int64_t execution_time_usec = timer(); // stop timer

while the "run_length" version does this:

    for (auto i = 0; i < run_length; i++)
00007FF7C77D270B  inc         r13d
00007FF7C77D270E  mov         dword ptr [rbp-79h],r13d
00007FF7C77D2712  movsxd      rax,r13d
00007FF7C77D2715  mov         r14,qword ptr [rbp-31h]
00007FF7C77D2719  cmp         rax,r14
00007FF7C77D271C  mov         edi,0FFFFFFFFh
00007FF7C77D2721  lea         rcx,[word_sz_generator (07FF7C77D9820h)]
00007FF7C77D2728  jb          runtest2+130h (07FF7C77D2550h)
    int64_t execution_time_usec = timer(); // stop timer

Notice how the faster version explicitly loads [rbp-31h] into r14 before comparing it to rax. Presumably so it can use it later. And it then puts 0FFFFFFFFh into edi. Meanwhile the slower version directly compares rax to memory and then loads that same constant into r14d.

Enough to create a 3% performance difference? Apparently so.

TL;DR The differences are there. I'm at a complete loss to explain them.

What CPU did you test on? This is optimized code, right? Surprising that both variables are spilled to memory. Both versions kept `run_length` in a register when I looked at the asm output on Godbolt for x86-64 MSVC 19 2017 in my answer. But it did still spill `i` after `inc`, though, and sign-extend before the compare. This code-gen makes sense: using a separate load to keep `run_length` will leave it in a reg for when the we leave the loop. Separate mov vs. cmp with a memory operand doesn't directly explain the perf difference for any CPU, though. Gotta be alignment effects. — Peter Cordes, Aug 03 '18 at 07:03
CPU is a Core i7-7820HK, 2.90 GHz. Alienware 15 R3 laptop, if you're insanely curious. Build is standard Visual Studio 2017 Community "Release", whatever optimization settings that equates to. Agree with the rest of your analysis - good call on that. — dgnuff, Aug 03 '18 at 07:11
@BeeOnRope the intention with that remark is to imply that we will never know. "Taking it up with Microsoft" is supposed to mean a pointless task, because I think just about everyone here will agree that Microsoft aren't likely to say much in a public forum about the inner details of how their optimizer works. — dgnuff, Aug 03 '18 at 18:12
Agreed. And thanks for the answer. It was very helpful to have some corroborating results. — Blair Fonville, Aug 04 '18 at 05:33

C++ code execution time varies with small source change that shouldn't introduce any extra work

3 Answers3

Comparing compiler output from the two versions

Linked