I'd [still!] like the full disassembly of case 2. But, I'll take a guess.
(1) The compiler fills rdi
with a value [the correct one]. It is the address of src
[probably from the new
and/or malloc
].
In the MS ABI, rdi
is considered "non-volatile". It must be preserved by a callee
(2) Case 2 then calls init_method
. But, init_method
does not preserve rdi
[as it must]. It uses it for its own purpose (e.g. edi
). So, upon return, rdi
has been trashed!
(3) When the program returns from init_method
, the compiler expects that rdi
will have the same value it had after step (1). (i.e.) The compiler has no knowledge that init_method
corrupted rdi
, so it uses its value to set rcx
[the first argument to ASM_Method
]. This should be the src
value but it's actually whatever value init_method
set it to (i.e. a junk value, relatively speaking)
UPDATE:
The ABI is different for various platforms [usually, just the compiler]. gcc
and clang
have a different calling convention than MS (i.e. MS is the odd duck or usual suspect). For example, with gcc/clang
, rdi
holds the first argument and is volatile
Here's the wiki link that should highlight most of the ABIs: https://en.wikipedia.org/wiki/X86_calling_conventions
UPDATE #2:
But why does one refer to the stack (i.e float src[64]) yet the other refers to registers (new float[64])before calling?
Because of compiler optimization. To explain, we'll "turn off" optimization for a bit.
All function scoped variables have a "reserved slot" in the function's stack frame. All these "slots" have a fixed offset within the stack frame that is known to [is computed by] the compiler. If the function has a stack frame at all [some leaf functions can elide it], then all variables have their slots, regardless if optimization is being used or not. Hold that thought ...
When you have a fixed size array as in case 1, the entire space (i.e. data) for that array is within the frame. So, the address of the given array is the frame pointer + the array's offset. Hence, the lea rcx,[rbp + offset_of_src]
Scalar variables have slots, too. That includes things like "pointers to arrays", which is what we have in case 2.
[Remember, optimization is off for the moment] Part of the missing code in case 2 was something like [simplified]:
// allocate src
call malloc
mov [ebp + offset_of_src],rax
// allocate dest
call malloc
mov [ebp + offset_of_dest],rax
// push arguments for init_method and call it
call init_method
// call ASM_Method
mov r8d,64
mov edx,[ebp + offset_of_dest]
mov ecx,[ebp + offset_of_src]
call ASM_Method
Notice, here, we don't want to "push" the address of the pointer variable, we want to "push" the contents of the pointer variable.
Now, let's turn the optimizer back on. Just because a function variable has a slot on the stack frame doesn't mean that the generated code is obligated to use it. For a simple function as in case 2, the optimizer realizes that it can use non-volatile registers to store the src
and dest
values and can eliminate stack access/storage for them.
So, with optimization, case 2 looks like:
// allocate src
call malloc
mov rdi,rax
// allocate dest
call malloc
mov rsi,rax
// push arguments for init_method and call it
call init_method
// call ASM_Method
mov r8d,64
mov edx,rsi
mov ecx,rdi
call ASM_Method
The particular non-volatiles selected by the compiler are arbitrary. In this instance, they just happened to be rsi
and rdi
but there are others to choose from.
The compiler/optimizer is quite clever about selecting these registers and others to hold data values. It can see when a given function no longer needs the value in the register and can reassign it to hold another [unrelated] value if it chooses.
Okay, remember the "hold that thought"? Time to exhale. Normally, once a variable is given a register assignment, the compiler tries to leave it alone until it's no longer needed. But, sometimes, there aren't enough registers to hold all active variables at one time.
For example, if a function has [say] four nested for
loops and uses 20 different variables, there aren't enough registers to go around. So, the compiler may have to generate code that "dumps" a value in a register back to the stack frame slot for the corresponding variable. This is a "register spill".
That's why there's always a slot in the stack frame for a scalar, even if it's never used [due to optimizing the value to a register]. It keeps the compilation process simpler and the offsets the same.
Also, we were talking about callee saved registers. But, what about caller saved registers. While most functions push non-volatiles upon entry and pop them at exit (i.e. they are preserving the non-volatiles for their caller).
A given function (e.g. A
) may use a volatile register to hold something (e.g. r10
) for a variable (e.g.) sludge
. If it calls another function (e.g. B
), B
might trash A
's value.
So, if A
wishes to preserve a value in r10
across a call to B
, A
must save it, call B
, and then restore it:
mov [rbp + offset_of_sludge],r10
call B
mov r10,[rbp + offset_of_sludge]
So, it's handy to have a stack frame slot available.
Sometimes, the function has so many variables that the code generated for some of them looks like the non-optimized version:
mov rax,[rbp + offset_of_foo]
add rax,rdx
sub rax,rdi
mov [rbp + offset_of_foo],rax
because foo
access/usage is too infrequent to merit a non-volatile register assignment