Calling Convention Discrepancy in x86_64 Assembly

Question

So I have a assembly routine with 3 parameters ASM_Method(void*, void*, int) and init_method(float, int*). The ones of interest are the void pointers at the former.

When I call the method from the C++ file with the parameters as:

float src[64];
float dest[64];
int radius[3];

init_method(1.5, radius);
ASM_Method(src, dest, 64);

Disassembly of this calling process:

mov         r8d,100h  
lea         rdx,[rbp+0A0h]  
lea         rcx,[rbp-60h]  
call        ASM_Method

Initialized or not, the program works fine. HOWEVER, when I do:

float* src = new float[64];
float* dest = new float[64];
int radius[3];

init_method(1.5, radius);
ASM_Method(src, dest, 64);

When called, RCX is set to a value that is NOT the correct address, but RDX is correct. The program crashes as a result.

Disassembly of this calling process:

mov         r8d,100h  
mov         rdx,rbx  
mov         rcx,rdi  
call        ASM_Method

Unless I initialize src to some values, RCX changes to an invalid address (in this case, 1) when called.

Assembly code for ASM_Method:

mov rax, rdx
add rax, r8
shr r8, 4
inc r8
xor r9, r9
movdqu xmm1, [rax]

MainLoop:
movdqu xmm0, [rcx + r9]
movdqu [rdx + r9], xmm0
add r9, 16
dec r8
jnz MainLoop

movdqu [rax], xmm1

ret

Assembly code for init_method:

mulss xmm0, xmm0
mov ecx, 4
cvtsi2ss xmm1, ecx
mulss xmm0, xmm1

shr ecx, 2
cvtsi2ss xmm2, ecx
addss xmm2, xmm0
sqrtss xmm2, xmm2

stmxcsr roundFlags
or roundFlags, 2000h
ldmxcsr roundFlags

cvtss2si ecx, xmm2

stmxcsr roundFlags
and roundFlags, 0DFFFh
ldmxcsr roundFlags

mov eax, ecx
dec eax
bt ecx, 0
cmovnc ecx, eax

mov eax, 3
cvtsi2ss xmm1, eax
mulss xmm0, xmm1

cvtsi2ss xmm3, ecx
movss xmm2, xmm3
movss xmm4, xmm3

mulss xmm2, xmm2
mulss xmm2, xmm1

mov eax, 12
cvtsi2ss xmm1, eax
mulss xmm3, xmm1

mov eax, -4
cvtsi2ss xmm1, eax
mulss xmm4, xmm1
addss xmm4, xmm1

mov eax, 9
cvtsi2ss xmm1, eax

subss xmm0, xmm2
addss xmm3, xmm1
subss xmm0, xmm3
divss xmm0, xmm4

cvtss2si eax, xmm0

mov esi, ecx
add esi, 2

mov edi, ecx
cmp eax, 0
cmovle edi, esi
shr edi, 1
mov dword ptr [edx], edi

mov edi, ecx
cmp eax, 1
cmovle edi, esi
shr edi, 1
mov dword ptr [edx + 4], edi

mov edi, ecx
cmp eax, 2
cmovle edi, esi
shr edi, 1
mov dword ptr [edx + 8], edi

ret

What is going on?

I only see 3 parameters in `ASM_Method(void*, void*, int)` You should show your assembler code for `ASM_Method` — Michael Petch, Apr 05 '16 at 03:48
Would also help to know the target platforms. Windows? Linux? OSX? — Michael Petch, Apr 05 '16 at 03:50
On windows first 3 parameters are _RCX_, _RDX_, _R8_. Linux first 3 parameters are in _RDI_, _RSI_, _RDX_ . I will hope you are targeting 64-bit Windows. — Michael Petch, Apr 05 '16 at 03:52
The platform is windows, and the Assembly Method is posted above. — user2741371, Apr 05 '16 at 03:53
I think that because RCX when the method is called has a value of 1. — user2741371, Apr 05 '16 at 04:00
I've sort of paused here because I don't see a reason why _RCX_ would have 1 in it. Can you show us your entire assembler file and your entire C++ program that demonstrates the issue? There has to be more to this than what is shown. I am still also confused about the fact you say `ASM_Method(void*, void*, int)` has 5 parameters. Did you mean 3? — Michael Petch, Apr 05 '16 at 04:18
`mov rax, rdx` / `add rax, r8` should be `lea rax, [rdx+r8]`. Using indexed addressing modes in your loop [stops the store from micro-fusing](http://stackoverflow.com/a/31027695/224132), so you should ideally use a single-reg addressing mode for the dest. You can still use a 2-reg addressing mode for the src. (e.g. `rcx=src-dst`, `rdx=dst`. Load from `[rcx+rdx]`, store to `[rdx]`. Increment `rdx` in the loop.) That gets the loop down to 4 uops, so it can run at one iteration per clock. — Peter Cordes, Apr 05 '16 at 04:24
Yes, I meant 3, I'll change it. I'll also add some additional code. I don't know if it's relevant to the problem, but i'll post it anyway. — user2741371, Apr 05 '16 at 04:25
Additionally, try posting the disassembly of the callers of ASM_Method (i.e. the C++ code). — Craig Estey, Apr 05 '16 at 04:28
I'm also going to make sure you are aware, `float` on 64-bit windows are still 4 bytes wide. — Michael Petch, Apr 05 '16 at 04:31
In ASM_Method, you're adding `r8` [has 64] to `rax` [has `&dest`]. Thus, rax is `&dest[16]`. You're loading that into `xmm1` and writing it back at the end. So, you're saving/restoring `dest[16..19]`. But, `dest` is uninitialized. Is that what you want? Otherwise, ASM_Method seems like a fast copy version of `memcpy` using 16 byte memory operations. — Craig Estey, Apr 05 '16 at 04:52
`init_method` I am suprised that works if you are showing complete code `mov dword ptr [edx], edi` your using 32-bit memory addresses. Why not 64-bit? And where is _EDX_ even initialized? That is don a few places in that code. — Michael Petch, Apr 05 '16 at 04:54
In `init_method` I almost think you intended to use _RCX_ instead of _EDX_ since the first parameter (float) gets passed in xmm0, second parameter in _RCX_ (which is a pointer to the `radius` buffer), and you destroy _RCX_ at the start (over writing the pointer) — Michael Petch, Apr 05 '16 at 05:01
@Michael Petch That's not part of the problem. Whether EDX/RDX is initialized or not, it still points to the correct address. ECX/RCX does have 1 in most cases, and i'm trying to avoid that. — user2741371, Apr 05 '16 at 05:02
You still didn't do as I asked really. I said to post your **full** C++ and assembler files, and turn this into a [Minimal Complete Verifiable Example](http://stackoverflow.com/help/mcve). The reason I even mentioned the problems in `init_method` is because I don't even know how it makes it to your `ASM_method` since I'd expect `init_method` to segfault trying to access bogus memory addresses long before it even gets to `ASM_Method` — Michael Petch, Apr 05 '16 at 05:05
It doesn't segfault, as RDX is a valid memory address (the pointer to the radius array). That's why I'm confused as to why the current information I provides is not enough. — user2741371, Apr 05 '16 at 05:18
I hate to repeat myself. Disassemble the _caller_ of `ASM_Method` in the two cases. This is your two top posted code blocks. This is what shows the values that get put into `rcx, rdx, r8` that are coming up funky. This is _not_ `init_method`. It _is_ the code that _calls_ `init_method` and then _calls_ `ASM_Method`. You have two versions: one that works and one that doesn't. Post both. — Craig Estey, Apr 05 '16 at 05:18
@user2741371 : It is a very bad idea to rely on the value of a register that is not actually used to pass parameters, but let us say you want to do something boneheaded like that, then _RDX_ and _EDX_ are not the same. If you use _EDX_ for memory operations like `mov dword ptr [edx], edi` you have literally chopped the top 32 bit of the memory address in _RDX_ and then used it to access memory. Your `init_code` is so broken that I am amazed it doesn't crash. — Michael Petch, Apr 05 '16 at 05:33
From the case 1/case 2 asm blocks: Your 3rd argument is `64` but the asm inst is `mov r8d,100h`, which is `256` decimal--not what you specified. This should be `mov r8d,40h` in _both_ cases. For case 2, we need more disassembly from above as we don't have the code that sets `rbx` and `rdi`. In particular, since `rcx` is coming up bad in ASM_Method, it is set from `rdi`, so that value [and its history] is of particular importance — Craig Estey, Apr 05 '16 at 05:33

Craig Estey · Accepted Answer · 2016-04-05T07:39:20.250

I'd [still!] like the full disassembly of case 2. But, I'll take a guess.

(1) The compiler fills rdi with a value [the correct one]. It is the address of src [probably from the new and/or malloc].

In the MS ABI, rdi is considered "non-volatile". It must be preserved by a callee

(2) Case 2 then calls init_method. But, init_method does not preserve rdi [as it must]. It uses it for its own purpose (e.g. edi). So, upon return, rdi has been trashed!

(3) When the program returns from init_method, the compiler expects that rdi will have the same value it had after step (1). (i.e.) The compiler has no knowledge that init_method corrupted rdi, so it uses its value to set rcx [the first argument to ASM_Method]. This should be the src value but it's actually whatever value init_method set it to (i.e. a junk value, relatively speaking)

UPDATE:

The ABI is different for various platforms [usually, just the compiler]. gcc and clang have a different calling convention than MS (i.e. MS is the odd duck or usual suspect). For example, with gcc/clang, rdi holds the first argument and is volatile

Here's the wiki link that should highlight most of the ABIs: https://en.wikipedia.org/wiki/X86_calling_conventions

UPDATE #2:

But why does one refer to the stack (i.e float src[64]) yet the other refers to registers (new float[64])before calling?

Because of compiler optimization. To explain, we'll "turn off" optimization for a bit.

All function scoped variables have a "reserved slot" in the function's stack frame. All these "slots" have a fixed offset within the stack frame that is known to [is computed by] the compiler. If the function has a stack frame at all [some leaf functions can elide it], then all variables have their slots, regardless if optimization is being used or not. Hold that thought ...

When you have a fixed size array as in case 1, the entire space (i.e. data) for that array is within the frame. So, the address of the given array is the frame pointer + the array's offset. Hence, the lea rcx,[rbp + offset_of_src]

Scalar variables have slots, too. That includes things like "pointers to arrays", which is what we have in case 2.

[Remember, optimization is off for the moment] Part of the missing code in case 2 was something like [simplified]:

// allocate src
call malloc
mov [ebp + offset_of_src],rax

// allocate dest
call malloc
mov [ebp + offset_of_dest],rax

// push arguments for init_method and call it
call init_method

// call ASM_Method
mov r8d,64
mov edx,[ebp + offset_of_dest]
mov ecx,[ebp + offset_of_src]
call ASM_Method

Notice, here, we don't want to "push" the address of the pointer variable, we want to "push" the contents of the pointer variable.

Now, let's turn the optimizer back on. Just because a function variable has a slot on the stack frame doesn't mean that the generated code is obligated to use it. For a simple function as in case 2, the optimizer realizes that it can use non-volatile registers to store the src and dest values and can eliminate stack access/storage for them.

So, with optimization, case 2 looks like:

// allocate src
call malloc
mov rdi,rax

// allocate dest
call malloc
mov rsi,rax

// push arguments for init_method and call it
call init_method

// call ASM_Method
mov r8d,64
mov edx,rsi
mov ecx,rdi
call ASM_Method

The particular non-volatiles selected by the compiler are arbitrary. In this instance, they just happened to be rsi and rdi but there are others to choose from.

The compiler/optimizer is quite clever about selecting these registers and others to hold data values. It can see when a given function no longer needs the value in the register and can reassign it to hold another [unrelated] value if it chooses.

Okay, remember the "hold that thought"? Time to exhale. Normally, once a variable is given a register assignment, the compiler tries to leave it alone until it's no longer needed. But, sometimes, there aren't enough registers to hold all active variables at one time.

For example, if a function has [say] four nested for loops and uses 20 different variables, there aren't enough registers to go around. So, the compiler may have to generate code that "dumps" a value in a register back to the stack frame slot for the corresponding variable. This is a "register spill".

That's why there's always a slot in the stack frame for a scalar, even if it's never used [due to optimizing the value to a register]. It keeps the compilation process simpler and the offsets the same.

Also, we were talking about callee saved registers. But, what about caller saved registers. While most functions push non-volatiles upon entry and pop them at exit (i.e. they are preserving the non-volatiles for their caller).

A given function (e.g. A) may use a volatile register to hold something (e.g. r10) for a variable (e.g.) sludge. If it calls another function (e.g. B), B might trash A's value.

So, if A wishes to preserve a value in r10 across a call to B, A must save it, call B, and then restore it:

mov [rbp + offset_of_sludge],r10
call B
mov r10,[rbp + offset_of_sludge]

So, it's handy to have a stack frame slot available.

Sometimes, the function has so many variables that the code generated for some of them looks like the non-optimized version:

mov rax,[rbp + offset_of_foo]
add rax,rdx
sub rax,rdi
mov [rbp + offset_of_foo],rax

because foo access/usage is too infrequent to merit a non-volatile register assignment

I was going to post this [link](https://msdn.microsoft.com/en-us/library/6t169e9c.aspx?f=255&MSPPError=-2147217396) as a comment earlier. I figure _RDI_ being trashed is why he is seeing that one particular behavior. _RSI_ will suffer the same fate. — Michael Petch, Apr 05 '16 at 06:02
@MichaelPetch It took me a while longer because I'm so used to the non-MS ABI [that I had to look it up when nothing else made sense]. Well, problem solved [at last!]. And, I think you're needed elsewhere as I espied a MikeOS question :-) — Craig Estey, Apr 05 '16 at 06:18
Yep, that was the problem. But why does one refer to the stack (i.e float src[64]) yet the other refers to registers (new float[64])before calling? — user2741371, Apr 05 '16 at 06:19

Calling Convention Discrepancy in x86_64 Assembly

1 Answers1