1

When it comes to temporarily storage for an existing value in a register, all modern compilers(at least the ones I experienced) do PUSH and POP instructions. But why not store the data in another register if it's available?

So, where should the temporarily storage for an existing value goes? Stack Or Register?

Consider the following 1st Code:

MOV ECX,16
LOOP:
PUSH ECX    ;Value saved to stack       
...     ;Assume that here's some code that must uses ECX register
POP ECX     ;Value released from stack  
SUB ECX,1
JNZ LOOP

Now consider the 2st Code:

MOV ECX,16
LOOP:
MOV ESI,ECX ;Value saved to ESI register    
...     ;Assume that here's some code that must uses ECX register
MOV ECX,ESI ;Value returned to ECX register
SUB ECX,1
JNZ LOOP

After all, which one of the above code is better and why?

Personally I think the first code is better on size since PUSH and POP only takes 1 bytes while MOV takes 2; and second code is better on speed because data moving between registers is faster than memory access.

dave
  • 4,812
  • 4
  • 25
  • 38
J.Smith
  • 111
  • 8
  • 5
    You usuall `push` values on the stack so you can use the registers they occupy. Why don't they move them to other registers? Probably because the other registers are needed for some values, too. – fuz Jan 10 '17 at 01:21
  • 3
    If the ESI register is free in the loop, you'd be better off to put the counter in ESI and simply not shuffle it around. If your compiler is smart, it will know that. Conclusion: either you have dumb compiler, or it knows that ESI isn't free in the loop, either, and there are no other free registers. In that case, the PUSH/POP combination isn't terrible. – Ira Baxter Jan 10 '17 at 02:23
  • 1
    *"all modern compilers(at least the ones I experienced) do PUSH and POP instructions"* ... this is quite bogus claim, try `gcc` or `clang`, they don't do that (unless they ran out of spare registers inside the loop, then I would bet they would rather use `[ebp-ofs]/[esp+ofs]` local variable). I would love to see some C/C++ source producing PUSH/POP with those two. Then again those two are also basically the only modern compilers, so I'm not sure what did you check. – Ped7g Jan 10 '17 at 04:05
  • Actually I tried to produce some source which would put counter for loop out of registers in `gcc`, and it's quite difficult with full optimizations on (to create some calculation which is not optimized out to some constant and requires many registers). I didn't manage to achieve clean "loop vars in regs, counter out" state, but when I exhausted registers enough (already some calculation variables done as `[esp-ofs]` local stack memory), the counter did land into `[esp-ofs]` local too, so the loop code was: `sub [esp-0x34],1` `jnz .L2` ... not even slightest sign of PUSH/POP possibility. – Ped7g Jan 10 '17 at 04:52
  • *"After all, which one of the above code is better and why?"* .. both are sub-optimal enough to change your compiler. – Ped7g Jan 10 '17 at 04:54
  • all modern compilers try to use registers as much as possible, and also try to do [omit](http://stackoverflow.com/q/14666665/995714) [frame pointer](http://stackoverflow.com/q/13006371/995714) to get another free register. However when the number is still not enough then obviously it'll need to do [register spilling](https://en.wikipedia.org/wiki/Register_allocation#Spilling) – phuclv Jan 10 '17 at 05:48
  • @IraBaxter,Fuz Thanks for posing out this. I actually came up with that second scenario myself, but yes, you're right, why not put counter into ESI in the first place? Ah, I was brainless:P. The question is pointless. However, addtionally, **I wonder if the two MOV in 2nd code is faster than the PUSH/POP in 1st code**. – J.Smith Jan 12 '17 at 01:57
  • The two register-to-register moves should be faster than the PUSH/POP, because the PUSH/POP has to access the cache, which is slower than accessing registers. – Ira Baxter Jan 12 '17 at 03:26

5 Answers5

2

It does make a lot of sense to do that. But I think the simplest answer is all the other registers are being used. In order to use some other register you would need to push it on the stack.

Compilers are smart enough. Keeping track of what is in a register for a compiler is somewhat trivial, that is not a problem. Speaking generically not necessarily x86 specific, esp when you have more registers (than an x86), you are going to have some registers that are used for input (in your calling convention), some you can trash, that may be the same as the input ones or not, some you cant trash you have to preserve them first. Some instruction sets have special registers, must use this one for auto increment, that one for register indirect, etc.

You will most definitely if not trivial to get the compiler to produce code for an arm for example where the input and the trashable registers are the same set, but that means that if you call another function and create the calling function right it needs to save something to use after the return:

unsigned int more_fun ( unsigned int );
unsigned int fun ( unsigned int x )
{
    return(more_fun(x)+x);
}
00000000 <fun>:
   0:   e92d4010    push    {r4, lr}
   4:   e1a04000    mov r4, r0
   8:   ebfffffe    bl  0 <more_fun>
   c:   e0840000    add r0, r4, r0
  10:   e8bd4010    pop {r4, lr}
  14:   e12fff1e    bx  lr

I told you it was trivial. Now to use your argument backward, why didnt they just push r0 on the stack and pop it off later, why push r4? Not r0-r3 are used for input and are volatile, r0 is the return register when it fits, r4 almost all the way up you have to preserve (one exception I think).

So r4 is assumed to be used by the caller or some caller up the line, the calling convention dictates you cannot trash it you must preserve it so you have to assume it is used. You can trash r0-r3, but you cant use one of those as the callee can trash them too, so in this case we need to take the incoming value x and both use it (pass it on) and preserve it for after the return so they did both, the "used another register with a move" but in order to do that they preserved that other register.

Why save r4 to the stack in this case is very obvious, you can save it up front with the return address, in particular arm wants you to always use the stack in 64 bit chunks so two registers at a time ideally or at least keep it aligned on a 64 bit boundary, so you have to save lr anyway, so they are going to push something else too even if they dont have, to in this case the saving of r4 is a freebie, and since they need to save r0 and at the same time use it. r4 or r5 or something above is a good choice.

BTW looks like an x86 compiler did with above.

0000000000000000 <fun>:
   0:   53                      push   %rbx
   1:   89 fb                   mov    %edi,%ebx
   3:   e8 00 00 00 00          callq  8 <fun+0x8>
   8:   01 d8                   add    %ebx,%eax
   a:   5b                      pop    %rbx
   b:   c3                      retq 

demonstration of them pushing something that they dont need to preserve:

unsigned int more_fun ( unsigned int );
unsigned int fun ( unsigned int x )
{
    return(more_fun(x)+1);
}
00000000 <fun>:
   0:   e92d4010    push    {r4, lr}
   4:   ebfffffe    bl  0 <more_fun>
   8:   e8bd4010    pop {r4, lr}
   c:   e2800001    add r0, r0, #1
  10:   e12fff1e    bx  lr

No reason to save r4, they just needed some register to make the stack aligned, so in this case r4 was chosen, some versions of this compiler you will see r3 or some other register used.

Remember humans (still) write compilers and the optimizers, etc. So they why this and why that is really a question for that human or those humans, and we cant really tell you what they were thinking. It is not a simple task for sure, but it is not hard to take a reasonable sized function and/or project and find opportunities to hand tune compiler output, to improve it. Of course beauty is in the eye of the beholder, one definition of improve is another's definition of make worse. One instruction mix might use less total instruction bytes, so that is "better" by program size standards, another may or may not use more instructions or bytes, but execute faster, one might have less memory accesses at the cost of instructions to ideally execute faster, etc.

There are architectures with hundreds of general purpose registers, but most of the ones we touch products with daily dont have that many, so you can generally make a function or some code that has so many variables in flight in a function that you have to start saving off to the stack mid function. So you cant always just save a few registers at the beginning and the end of the function to give you more working registers mid function, if the number of working registers you need mid function is more registers than you have. It actually takes some practice to be able to write code that doesnt optimize to the point of not needing too many registers, but once you start to see how the compilers work by examining their output, you can write trivial functions like the ones above to prevent optimizations or force preservation of registers mid function, etc.

At the end of the day for the compiler to be somewhat sane it needs a calling convention, it keeps the authors from going crazy and the compiler from being a nightmare to code and manage. And the calling convention is very clearly going to define the input and output register(s) any volatile registers, and the ones that have to be preserved.

unsigned int fun ( unsigned int x, unsigned int y, unsigned int z )
{
    unsigned int a;

    a=x<<y;
    a+=(y<<z);
    a+=x+y+z;
    return(a);
}
00000000 <fun>:
   0:   e0813002    add r3, r1, r2
   4:   e0833000    add r3, r3, r0
   8:   e0832211    add r2, r3, r1, lsl r2
   c:   e0820110    add r0, r2, r0, lsl r1
  10:   e12fff1e    bx  lr

Only spent a few seconds on that but could have worked harder on it. I didnt push past four registers total, granted I had four variables. And I didnt call any functions so the compiler was free to just trash r0-r3 as needed as the dependencies worked out. So I didnt have to save r4 in order to create a temporary storage, it didnt have to use the stack it just optimized the order of execution to for example free up r2, the z variable so that later it could use r2 as an intermediate variable, one of the instances of a equals something. Keeping it down to four registers instead of burning a fifth one.

If I was more creative with my code and I added in calls to functions, I could get it to burn a lot more registers, you would see as even in this last case, the compiler has no problem whatsoever keeping track of what is where, and you will see when you play with the compilers there is no reason that they have to keep your high level language variables intact in the same register throughout much less execute in the same order you wrote your code (so long as it is legal), but they are still at the mercy of the calling convention, if any only some of the registers are considered volatile, and if you call a function from your function at a certain time in the code, then you have to preserve that content so you cant use them as long term storage, and the ones that are not volatile are already considered to be consumed so they have to be preserved to use them, then it becomes in part a performance question, does it cost more (size, speed, etc) to save to the stack on the fly or can I preserve up front in a way that possibly reduces instructions or can be invisible and/or consume less clocks with a larger transfer rather than separate, less efficient transfers mid function?

I have said this seven times now but the bottom line is the calling convention for that compiler (version) and target (and command line options/defaults). If you have volatile registers (arbitrary calling convention thing for general purpose registers, not a hardware/ISA thing) and you are not calling any other functions, then they are easy to use and save you expensive stack (memory) transactions. If you are calling someone then they can be trashed by them so they may no longer be free, depends on your code. The non-volatile registers are considered consumed by callers so you have to burn stack operations in order to use them, they are not free to use. And then it becomes performance as to when and where to use the stack, pushes and pops and movs. No two compilers are expected to generate the same code even if they use the same convention, but you can see above it is somewhat trivial to make test functions, compile them and examine the output, tweak here and there to navigate through and around that (compiler, version and target and convention and command line options) optimizer.

old_timer
  • 69,149
  • 8
  • 89
  • 168
  • "Remember humans write the optimizers": right. As they do it via algorithms (such as register allocators) that look for optimal configurations, probably even themselves cannot predict the outcomes. –  Jan 10 '17 at 09:51
  • What I meant by that is without a human writing the code to choose to copy to a register, the compiler cant/wont copy to a register. Neither of the optimizations above would have happened without a human knowing/thinking of that and implementing it. Now sure the implementation is likely an algorithm that chooses based on some parameters if it should try it, and with what registers. – old_timer Jan 10 '17 at 13:44
  • You will find that different compilers written by different humans tend to use different instruction mixes, partly due to different algorithms, partly due to different humans. When testing a processor I found one compiler didnt have code to use some of the instructions, would have never generated them other than perhaps inline assembly. Where others did have cases to use them. Some sequences would never happen in one compiler that would in others (found a chip bug by simply switching compilers with the same test code). – old_timer Jan 10 '17 at 13:46
1

Using a register is a bit faster, but requires you to keep track of which registers are available, and you can run out of registers. Also, this method cannot be use recursively. In addition, some registers will get trashed if you use INT or CALL to invoke a subroutine.

Use of the stack (POP and PUSH) can be used as many times as needed (so long as you don't run out of stack space), and in addition it supports recursive logic. You can use the stack safely with INT or CALL because by convention any subroutine should reserve its own portion of the stack, and must restore it to its previous state (or else the RET instruction would fail).

John Wu
  • 50,556
  • 8
  • 44
  • 80
1

Do trust the work of the optimizing compiler, based on the work of decades of code generation specialists.

They fill as much registers as available and extend to the stack when needed, comparing different options. And they also care about tradeoffs between storing a value for later reuse vs. recomputation of the value.

There is no single rule "register vs. stack", it's a matter of global optimization, taking into account the processor's peculiarities. And in general, there is no single "best solution" as it will depend on your "bestness" criteria.

Except when very creative workarounds can be found (or when exploiting data properties known of you only), you can't beat a compiler.

0

When thinking about speed, you always have to keep in mind a sense of proportion.

If the function being compiled calls other functions, those push and pop instructions may be insignificant, compared to the number of instructions executed in between them.

Compiler writers know, in that kind of case, which is very common, one shouldn't be penny-wise and pound-foolish.

Mike Dunlavey
  • 40,059
  • 14
  • 91
  • 135
0

By using PUSH and POP, you can save at least one registers. This will be significant if you working with limited available registers. On the other hand, yes, sometimes using MOV is better in speed, but you also have to keep in mind which register is used as a temporary storage. This will be hard if you want to store several values that needed to be processed later

duck
  • 369
  • 3
  • 17
  • If you have a spare register (like ESI), you don't need `mov` at all, just use `dec esi` / `jnz` like a normal person, not copying it to ECX and back. Both code blocks in the question are inefficient. If you did want to use stack memory, you'd use `sub dword [esp + 4], 1` / `jnz` or something, like a compiler would. – Peter Cordes Jan 22 '23 at 16:02