It does make a lot of sense to do that. But I think the simplest answer is all the other registers are being used. In order to use some other register you would need to push it on the stack.
Compilers are smart enough. Keeping track of what is in a register for a compiler is somewhat trivial, that is not a problem. Speaking generically not necessarily x86 specific, esp when you have more registers (than an x86), you are going to have some registers that are used for input (in your calling convention), some you can trash, that may be the same as the input ones or not, some you cant trash you have to preserve them first. Some instruction sets have special registers, must use this one for auto increment, that one for register indirect, etc.
You will most definitely if not trivial to get the compiler to produce code for an arm for example where the input and the trashable registers are the same set, but that means that if you call another function and create the calling function right it needs to save something to use after the return:
unsigned int more_fun ( unsigned int );
unsigned int fun ( unsigned int x )
{
return(more_fun(x)+x);
}
00000000 <fun>:
0: e92d4010 push {r4, lr}
4: e1a04000 mov r4, r0
8: ebfffffe bl 0 <more_fun>
c: e0840000 add r0, r4, r0
10: e8bd4010 pop {r4, lr}
14: e12fff1e bx lr
I told you it was trivial. Now to use your argument backward, why didnt they just push r0 on the stack and pop it off later, why push r4? Not r0-r3 are used for input and are volatile, r0 is the return register when it fits, r4 almost all the way up you have to preserve (one exception I think).
So r4 is assumed to be used by the caller or some caller up the line, the calling convention dictates you cannot trash it you must preserve it so you have to assume it is used. You can trash r0-r3, but you cant use one of those as the callee can trash them too, so in this case we need to take the incoming value x and both use it (pass it on) and preserve it for after the return so they did both, the "used another register with a move" but in order to do that they preserved that other register.
Why save r4 to the stack in this case is very obvious, you can save it up front with the return address, in particular arm wants you to always use the stack in 64 bit chunks so two registers at a time ideally or at least keep it aligned on a 64 bit boundary, so you have to save lr anyway, so they are going to push something else too even if they dont have, to in this case the saving of r4 is a freebie, and since they need to save r0 and at the same time use it. r4 or r5 or something above is a good choice.
BTW looks like an x86 compiler did with above.
0000000000000000 <fun>:
0: 53 push %rbx
1: 89 fb mov %edi,%ebx
3: e8 00 00 00 00 callq 8 <fun+0x8>
8: 01 d8 add %ebx,%eax
a: 5b pop %rbx
b: c3 retq
demonstration of them pushing something that they dont need to preserve:
unsigned int more_fun ( unsigned int );
unsigned int fun ( unsigned int x )
{
return(more_fun(x)+1);
}
00000000 <fun>:
0: e92d4010 push {r4, lr}
4: ebfffffe bl 0 <more_fun>
8: e8bd4010 pop {r4, lr}
c: e2800001 add r0, r0, #1
10: e12fff1e bx lr
No reason to save r4, they just needed some register to make the stack aligned, so in this case r4 was chosen, some versions of this compiler you will see r3 or some other register used.
Remember humans (still) write compilers and the optimizers, etc. So they why this and why that is really a question for that human or those humans, and we cant really tell you what they were thinking. It is not a simple task for sure, but it is not hard to take a reasonable sized function and/or project and find opportunities to hand tune compiler output, to improve it. Of course beauty is in the eye of the beholder, one definition of improve is another's definition of make worse. One instruction mix might use less total instruction bytes, so that is "better" by program size standards, another may or may not use more instructions or bytes, but execute faster, one might have less memory accesses at the cost of instructions to ideally execute faster, etc.
There are architectures with hundreds of general purpose registers, but most of the ones we touch products with daily dont have that many, so you can generally make a function or some code that has so many variables in flight in a function that you have to start saving off to the stack mid function. So you cant always just save a few registers at the beginning and the end of the function to give you more working registers mid function, if the number of working registers you need mid function is more registers than you have. It actually takes some practice to be able to write code that doesnt optimize to the point of not needing too many registers, but once you start to see how the compilers work by examining their output, you can write trivial functions like the ones above to prevent optimizations or force preservation of registers mid function, etc.
At the end of the day for the compiler to be somewhat sane it needs a calling convention, it keeps the authors from going crazy and the compiler from being a nightmare to code and manage. And the calling convention is very clearly going to define the input and output register(s) any volatile registers, and the ones that have to be preserved.
unsigned int fun ( unsigned int x, unsigned int y, unsigned int z )
{
unsigned int a;
a=x<<y;
a+=(y<<z);
a+=x+y+z;
return(a);
}
00000000 <fun>:
0: e0813002 add r3, r1, r2
4: e0833000 add r3, r3, r0
8: e0832211 add r2, r3, r1, lsl r2
c: e0820110 add r0, r2, r0, lsl r1
10: e12fff1e bx lr
Only spent a few seconds on that but could have worked harder on it. I didnt push past four registers total, granted I had four variables. And I didnt call any functions so the compiler was free to just trash r0-r3 as needed as the dependencies worked out. So I didnt have to save r4 in order to create a temporary storage, it didnt have to use the stack it just optimized the order of execution to for example free up r2, the z variable so that later it could use r2 as an intermediate variable, one of the instances of a equals something. Keeping it down to four registers instead of burning a fifth one.
If I was more creative with my code and I added in calls to functions, I could get it to burn a lot more registers, you would see as even in this last case, the compiler has no problem whatsoever keeping track of what is where, and you will see when you play with the compilers there is no reason that they have to keep your high level language variables intact in the same register throughout much less execute in the same order you wrote your code (so long as it is legal), but they are still at the mercy of the calling convention, if any only some of the registers are considered volatile, and if you call a function from your function at a certain time in the code, then you have to preserve that content so you cant use them as long term storage, and the ones that are not volatile are already considered to be consumed so they have to be preserved to use them, then it becomes in part a performance question, does it cost more (size, speed, etc) to save to the stack on the fly or can I preserve up front in a way that possibly reduces instructions or can be invisible and/or consume less clocks with a larger transfer rather than separate, less efficient transfers mid function?
I have said this seven times now but the bottom line is the calling convention for that compiler (version) and target (and command line options/defaults). If you have volatile registers (arbitrary calling convention thing for general purpose registers, not a hardware/ISA thing) and you are not calling any other functions, then they are easy to use and save you expensive stack (memory) transactions. If you are calling someone then they can be trashed by them so they may no longer be free, depends on your code. The non-volatile registers are considered consumed by callers so you have to burn stack operations in order to use them, they are not free to use. And then it becomes performance as to when and where to use the stack, pushes and pops and movs. No two compilers are expected to generate the same code even if they use the same convention, but you can see above it is somewhat trivial to make test functions, compile them and examine the output, tweak here and there to navigate through and around that (compiler, version and target and convention and command line options) optimizer.