Why is including an extra assembly instruction in a loop increasing execution speed?

Question

I have two pieces of code which produced the following assembly line instructions from a gdb dump.

# faster on my CPU

# Dump of assembler code for function main():
# This was produced when I declared increment inside the loop
# <snipped> I can put back the removed portions if requested.
0x00000000004007ee <+17>:   movq   $0x0,-0x8(%rbp)
0x00000000004007f6 <+25>:   movl   $0x0,-0xc(%rbp)
0x00000000004007fd <+32>:   jmp    0x400813 <main()+54>
0x00000000004007ff <+34>:   movl   $0xa,-0x1c(%rbp)
0x0000000000400806 <+41>:   mov    -0x1c(%rbp),%eax
0x0000000000400809 <+44>:   cltq   
0x000000000040080b <+46>:   add    %rax,-0x8(%rbp)
0x000000000040080f <+50>:   addl   $0x1,-0xc(%rbp)
0x0000000000400813 <+54>:   cmpl   $0x773593ff,-0xc(%rbp)
0x000000000040081a <+61>:   jle    0x4007ff <main()+34>
# <snipped>
# End of assembler dump.

and then this piece of code.

# slower on my CPU

# Dump of assembler code for function main():
# This was produced when I declared increment outside the loop.
# <snipped>
0x00000000004007ee <+17>:   movq   $0x0,-0x8(%rbp)
0x00000000004007f6 <+25>:   movl   $0xa,-0x1c(%rbp)
0x00000000004007fd <+32>:   movl   $0x0,-0xc(%rbp)
0x0000000000400804 <+39>:   jmp    0x400813 <main()+54>
0x0000000000400806 <+41>:   mov    -0x1c(%rbp),%eax
0x0000000000400809 <+44>:   cltq   
0x000000000040080b <+46>:   add    %rax,-0x8(%rbp)
0x000000000040080f <+50>:   addl   $0x1,-0xc(%rbp)
0x0000000000400813 <+54>:   cmpl   $0x773593ff,-0xc(%rbp)
0x000000000040081a <+61>:   jle    0x400806 <main()+41>
# <snipped>
# End of assembler dump.

As can be seen, the only difference is the position of this line:

0x00000000004007f6 <+25>:   movl   $0xa,-0x1c(%rbp)

In one version it is inside the loop, in the other version it is outside it. I would expect that the version with less inside of the loop to run faster, yet instead it runs slower.

Why is this?

Extra Info

If relevant, here are the details of my own experiments and the c++ code that produced it.

I tested this across multiple computers running either Red Hat Enterprise Linux Workstation (Version 7.5) or Windows 10. All the computers in question either had a Xeon processor (Linux) or a i7-4510U (Windows 10). I used g++ without any flags to compile, or Visual Studio Community edition 2017. All the results agreed: declaring the variable in the loop resulted in a speedup.

Multiple runs had a runtime of ~5.00s (very little variance) when increment was declared inside the loop on a 64-bit Linux machine.

Multiple runs had a runtime of ~5.40s (Again, very little variance) when increment was declared outside the loop on the same machine.

Declaring the variable inside the loop.

#include <ctime>
#include <iostream>

using namespace std;

int main()
{
    clock_t begin, end;

    begin = clock();

    long int sum = 0;

    for(int i = 0; i < 2000000000; i++)
    {
        int increment = 10;
        sum += increment;
    }
    end = clock();
    double elapsed = double(end - begin) / CLOCKS_PER_SEC;
    cout << elasped << endl;
}

Declaring the variable outside the loop:

#include <ctime>
#include <iostream>

using namespace std;

int main()
{
    clock_t begin, end;

    begin = clock();

    long int sum = 0;
    int increment = 10;
    for(int i = 0; i < 2000000000; i++)
    {
        sum += increment;
    }
    end = clock();
    double elapsed = double(end - begin) / CLOCKS_PER_SEC;
    cout << elasped << endl;
}

I have heavily edited this question because of feedback from the comments. It is much better now, thank you to those who helped refine it! My apologies to those who already put in the effort of answering my unclear question, if the answers and comments seem irrelevant it is because of my inability to communicate.

score 3 · Answer 1 · answered Sep 04 '18 at 16:58

3

While it's generally true that a value we don't need to keep around can be dumped in a register and not pulled into main memory, as long as a register is available, the quote is at best over-simplified (or out of date) and at worst nonsense.

A compiler in 2018 knows whether you're going to re-use that value, whether the declaration is found within the loop body or not. Okay, so declaring the variable inside the loop makes the compiler's job a little easier, but your compiler is smart.

Moving the declaration in a trivial example like this just isn't going to have any effect on a program compiled by a modern toolchain. A C++ program is not a one-to-one mapping of machine instructions; it is a description of a program. The reason that people say "the difference is only academic" is that the difference is only academic. Like, literally.

answered Sep 04 '18 at 16:58

Lightness Races in Orbit

378,754
76
643
1,055

I think I explained that I understand that it is only academic, and that a modern machine would happily get rid of it. I know that if I put on the optimization flag -O3 on g++ compiler that it will very merrily just multiply the whole thing and be done in next to no time. I am interested in the academic reason, that is the whole reason for the question. – Asa Sep 04 '18 at 17:05
3

@Asa But that's my point - it's academic because the source code is purely a theoretical representation of the resulting program, so changes you make in it need not have any direct physical result. This one doesn't. I find it helpful to stop thinking about "the optimiser" and bear in mind that there is a whole translation process going on that is not line-by-line. It's just that setting a low "optimisation level" prompts the compiler to produce more terrible code that matches the original input a little more directly. – Lightness Races in Orbit Sep 04 '18 at 17:06
@Asa And if you're looking for a practical explanation of the phenomenon referred to by the quoted claim, I can't do much better than telling you that said phenomenon does not exist ;) – Lightness Races in Orbit Sep 04 '18 at 17:08
I know what the optimizer is doing, (basically, I'm hardly an expert). I think my question is more on the hardware level than on the "What does this C++ code do". Something about this code, and the resulting hardware level instructions, triggers a speedup that doesn't make sense; even when viewed from a microcode level. I'm wondering what causes that effect. I'm just curious about strange results. – Asa Sep 04 '18 at 17:14
@Asa Then it sounds like this is more an assembly question than a c++ question. – François Andrieux Sep 04 '18 at 17:22
@Lightness Races in Orbit I noticed the results and didn't see the claim until after I started looking for an explanation. That was the closest thing to an explanation I could find, which is why I included it. (Pretty much to show that I wasn't just posting a question without doing some of my own work before hand.) Thanks for refuting the quote, it didn't make much sense to me and I doubted it's validity. Good to have a second opinion. – Asa Sep 04 '18 at 17:22
Thank you Fran(fancy c)ios Andrieux. I'll change the tags. (Sorry, I'm a newbie at putting in special characters) – Asa Sep 04 '18 at 17:24
@Asa Okay, perhaps through that editing (and my own tiredness) I've missed the core of your real question then – Lightness Races in Orbit Sep 04 '18 at 17:24
@LightnessRacesinOrbit Thanks for answering even though you are tired! Do you have any suggestions as how I can improve my question so it is less about the quote? – Asa Sep 04 '18 at 17:38
@Asa Oh I'm always tired ^_^ Looks like you've solved it with an edit already – Lightness Races in Orbit Sep 05 '18 at 10:31

Davislor · Answer 2 · 2018-09-04T20:03:40.970

Some Advice

First, you aren’t compiling with optimization. That’s a mistake. It’s not even a good idea to do when you’re debugging, unless you need to catch a logic error by single-stepping through some code. The code that will be emitted is so different from the final optimized version that it won’t have the same bugs. You want the compiler to expose faulty assumptions in your code!

Second, a much better way to see what assembly code you’re generating is to compile with the -S flag and check the resulting file with a .S extension.

You should generally be compiling with optimization and warnings turned on, perhaps -g -O -Wall -Wextra -Wpedantic -Wconversion plus -std=c++17 or whatever version of the language you were writing in. You might want to set your CFLAGS/CXXFLAGS environment variables, or create a makefile.

What Happened Here

Without optimization, the compiler is too brain-damaged even to keep increment in a register or fold it into a constant. The line in the code dump corresponding to int increment = 10; is movl $0xa,-0x1c(%rbp), which spills the variable onto the stack and loads the constant 10 into that memory location.

In the code fragment

long int sum = 0;

for(int i = 0; i < 2000000000; i++)
{
    int increment = 10;
    sum += increment;
}

the compiler can easily see thatincrement cannot be altered or used outside the body of the loop. It’s only declared within the scope of the loop body and is always set to 10 at the start of each invocation. The compiler only needs to statically analyze the body of the loop to determine that increment is just a constant that can be folded.

Now compare:

long int sum = 0;
int increment = 10;
for(int i = 0; i < 2000000000; i++)
{
    sum += increment;
}

In this fragment, increment is just like sum. Both variables are declared outside the loop, and neither declared as a constant. In theory, its value could change between iterations of the loop, like sum. A human who knows C can easily see that increment won’t change while the loop is being run, and a decent compiler ought to be able to as well, but when you turn optimization off completely, this one couldn’t.

The unoptimized code does not even preserve this variable in a register between invocations of the loop! Looking at the code dump, the first instruction it executes on every iteration is, mov -0x1c(%rbp),%eax. That re-loads the value of increment from memory. This is the proximate cause of the slowdown.

Some More Advice

Since increment is a constant known at compile time, it would be a good idea to declare it as constexpr in C++, or in C, static const. In such a simple example, a modern compiler should not need the hint, but in more complicated situations, it might still make a difference.

The real advantage is for a human maintainer. It tells the compiler to stop you from shooting yourself in the foot. I tend to write most of my code as static single assignments, which is what most C compilers transform the program into anyway, because they’re easier for both computers and humans to understand and reason about. That is, whenever possible, all variables are declared as constants, set only once. Each value only ever has one meaning. Bugs where you think you’re using the old value after you update or the new value before you update cannot occur. The optimizing compiler takes care of moving values into and out of registers for you.

Thank you for the advice :). Due to feedback I realized that I had written my question in such a way that it was totally unclear. I have completely rewritten it (Sorry!). Both versions have `mov -0x1c(%rbp),%eax`, however only one version has `movl $0xa, -0x1c(%rbp)` declared inside the loop, and that is the faster version. Hopefully the current version of my question makes clear that I'm wondering why having an extra instruction is increases execution speed. — Asa, Sep 04 '18 at 20:10
The answer probably relates to the microcode of your CPU. When the store and the load are consecutive like that, it might realize that the value being moved to another register is the same as the immediate constant being stored, and fuse them. Just a guess. The real-world answer is, don’t turn your optimizer off, even when debugging. — Davislor, Sep 04 '18 at 20:17
@Davislor: The modern meaning of "microcode" in this context is multiple uops read from a ROM when decoding a complex instruction. You're right that there's a weird micro-architectural effect, but it's not fusion between instructions. It's Sandybridge-family's funky store-forwarding behaviour where latency improves if the reload doesn't try to execute right away. My answer on [Adding a redundant assignment speeds up code when compiled without optimization](https://stackoverflow.com/q/49189685) covers it, including why gcc doesn't optimize across C statements at `-O0` (for debugging) — Peter Cordes, Sep 04 '18 at 20:47
Anyway, this is an exact duplicate; but my dup-hammer didn't work because the C++ tag was removed and I added back the x86 and performance tags myself :/ Please help close it as a duplicate. (Or @asa you could click close as duplicate on your own question. Note that the other question is also asking about an extra asm store inside the loop causing the same effect you're seeing, because of similar C++ source.) — Peter Cordes, Sep 04 '18 at 20:49
@PeterCordes That is exactly what I was looking for! Thank you! I agree that it is a duplicate. Sorry my Google-Fu wasn't good enough to stumble across that answer in the first place. — Asa, Sep 04 '18 at 21:16
@Asa: no worries; if you don't know the *cause*, it's not going to be easy to google on the things that will turn up that Q&A. It's not a bad thing to post a good question that experts can see is a duplicate, because SO is set up to handle it nicely. And now this is another signpost that someone in the future with the same question might find and be guided to the dup target. — Peter Cordes, Sep 04 '18 at 21:23

score 1 · Answer 3 · answered Sep 04 '18 at 17:23

This is clearly unoptimized, first and foremost thats just dead code so it would go away. The compiler did what you asked, you put an additional assignment inside the loop and it did, simple as that, unoptimized its no surprise that you feel that. If you are interested in performance then you want to work with optimized code. There are optimization problems with your experiment. This has nothing to do with registers and preservation of variables. You have two operations in one loop and one in another. You need to work on and understand a whole lot more, these kinds of simple tests find other issues like alignment.

I can take two instructions subtract and jump if not zero back to the subtract, depending on the architecture, implementation, where it is running and alignment those two instructions can have VASTLY different performance, same exact machine code, same computer/processor, even if you do very accurate time measurement which you are not doing here either. Basically loops like this are used to demonstrate how bad benchmarking is not how good/useful it is.

You cant really get the compiler to both register these variables and also optimize without removing the whole thing as dead code, not reliably. So using a high level language like this you are causing memory accesses with a typical implementation, which if against dram on a shared computer, one hopes it is cached, but it might not be. Even if cached the first loop can cost hundreds of cycles or more and can be noticed at times depending on the number of loops and the accuracy of the measurement.

Your begin/end variables are not volatile not that that is a magic solution, but I have seen optimizers put both reads after or before the loop as one thing is not relevant to another, causing the measurement to be bad.

Abrash's Zen of Assembly and other books are good for learning about performance and pitfalls. The importance of a good measurement, and having care not to go down the wrong assumption paths as to what is going on.

Note that prior question, like this one, should be closed as primarily opinion based, but that prior question did have an accurate and complete answer and that is the selected answer. And like this one that is you have to measure it. From the code as written, the test as designed, you results can/will vary, yes you can have more instructions execute faster than fewer, often not difficult to demonstrate that. There is no better, faster, cheaper asking questions with those words generally result in something unanswerable. Having two operations in the loop vs one and not optimizing (this code was not designed to be optimized, and using volatile wont necessarily save it) a compiler is most likely going to just do the two operations in one vs one operation in the other. Plus the overhead to do that as needed. But I could choose a platform and probably show the loop with more operations as being faster. And so can you with experience.

So despite the two operations and no optimization, still possible for the one operation loop to be slower, but wouldnt be surprised if the two operation is slower in the majority of the experiments.

one vs two assumes the add and store the result as one operation, but not all instruction sets (not specified but your disassembly implies one) can do this in one instruction. — old_timer, Sep 04 '18 at 18:18

Why is including an extra assembly instruction in a loop increasing execution speed?

Extra Info

3 Answers3

Some Advice

What Happened Here

Some More Advice