Why does a function double dereference arguments stored on stack and how is that possible?

Question

I tried to understand "lfunction" stack arguments loading to "flist" in following assembly code I found on a book (The book doesn't explain it. Code compiles and run without errors giving intended output displaying "The string is: ABCDEFGHIJ".) but I can't grasp the legality or logic of the code. What I don't understand is listed below.

In lfunction:

Non-volatile (as per Microsoft x64 calling convention) register RBX is not backed up before 'XOR'ing. (But it is not what bugs me most.)
In portion ";arguments on stack"
```
mov rax, qword [rbp+8+8+32]
mov bl,[rax]
```
Here [rbp+8+8+32] dereferences corresponding address stored in stack so RAX should be loaded with value represented by'fourth' which is char 'D'(0x44) as per my understanding (Why qword?). And if so, what dereferencing char 'D' in second line can possibly mean (There should be a memory address to dereference but 'D' is a char.)?

Original code is listed below:

%include "io64.inc"
; stack.asm
extern printf
section .data
first db "A"
second db "B"
third db "C"
fourth db "D"
fifth db "E"
sixth db "F"
seventh db "G"
eighth db "H"
ninth db "I"
tenth db "J"
fmt db "The string is: %s",10,0
section .bss
flist resb 14 ;length of string plus end 0
section .text
global main
main:
push rbp
mov rbp,rsp
sub rsp, 8
mov rcx, flist
mov rdx, first
mov r8, second
mov r9, third
push tenth ; now start pushing in
push ninth ; reverse order
push eighth
push seventh
push sixth
push fifth
push fourth
sub rsp,32 ; shadow
call lfunc
add rsp,32+8
; print the result
mov rcx, fmt
mov rdx, flist
sub rsp,32+8
call printf
add rsp,32+8
leave
ret
;––––––––––––––––––––––––-
lfunc:
push rbp
mov rbp,rsp
xor rax,rax ;clear rax (especially higher bits)
;arguments in registers
mov al,byte[rdx] ; move content argument to al
mov [rcx], al ; store al to memory(resrved at section .bss)
mov al, byte[r8]
mov [rcx+1], al
mov al, byte[r9]
mov [rcx+2], al
;arguments on stack
xor rbx,rbx
mov rax, qword [rbp+8+8+32] ; rsp + rbp + return address + shadow
mov bl,[rax]
mov [rcx+3], bl
mov rax, qword [rbp+48+8]
mov bl,[rax]
mov [rcx+4], bl
mov rax, qword [rbp+48+16]
mov bl,[rax]
mov [rcx+5], bl
mov rax, qword [rbp+48+24]
mov bl,[rax]
mov [rcx+6], bl
mov rax, qword [rbp+48+32]
mov bl,[rax]
mov [rcx+7], bl
mov rax, qword [rbp+48+40]
mov bl,[rax]
mov [rcx+8], bl
mov rax, qword [rbp+48+48]
mov bl,[rax]
mov [rcx+9], bl
mov bl,0 ; terminating zero
mov [rcx+10], bl
leave
ret

Additional info:

I cannot look at register values just after line 50 which corresponds to "XOR RAX, RAX" in lfunc because debugger auto skips single stepping to line 37 of main function which corresponds to "add RSP, 32+8". Even If I marked breakpoints in between aforementioned lines in lfunc code the debugger simply hangs so I have to manually abort debugging.

In portion ";arguments on stack"

     mov rax, qword [rbp+8+8+32]
     mov bl,[rax]

I am mentioning this again to be more precise of what am asking because question was marked as duplicate and provided links with answers that doesn't address my specific issue. At line [rbp+8+8+32] == 0x44 because clearly, mov with square brackets dereferences reference address (which I assume 64bit width) rbp+3h. So, the size of 0x44 is byte. That is why ask "Why qword?" because it implies "lea [rbp+8+8+32]" which is a qword reference, not mov. So if [rbp+8+8+32] equals 0x44, then [rax] == [0x0000000000000044], which a garbage ( not relevant to our code here) address.

A `char*` in stack memory has to get loaded before the C dereference can happen to load the `char`. — Peter Cordes, Jul 31 '22 at 04:23
You're right that the function is buggy, destroying RBX without saving/restoring it, but apparently the CRT start code doesn't care. It probably only needs RSP and maybe RBP to be preserved by main. It's not a good example, though. They could easily have used RDX / DL for a 3rd temporary. — Peter Cordes, Jul 31 '22 at 04:24
And they could have used `movzx edx, byte [rax]` instead of spending an extra instruction to xor-zero RDX first. If they're on a CPU that renames DL separately from RDX (like Sandy Bridge or older, including P6 family), then there's no false dependency and no benefit to xor-zeroing first (because they don't later read the full RDX or in their case RBX). But on later Intel CPUs, or AMD, there's a false dependency of each `char` load on the previous. I guess xor-zeroing once does break any false dependency with some other prevous use, and this dep chain is shortish. — Peter Cordes, Jul 31 '22 at 04:28
But clearly they don't care about performance *or* code size or general efficiency if they're doing stuff like `mov bl,0` / `mov [rcx+10], bl` instead of `mov byte [rcx+10], 0`, so IDK why they chose to xor-zero before writing BL. Oh, maybe so a debugger that shows you the value of the full register would be showing just the character value without high garbage, but they didn't want to use `movzx` in their first load for some reason. — Peter Cordes, Jul 31 '22 at 04:29
Maybe you didn't realize that `push tenth` is pushing a pointer (as an absolute address that has to fit in a 32-bit sign-extended immediate, so this can only build as a non-largeaddressaware program). This is NASM syntax, so `tenth` is the address, not a memory source operand. **Use a debugger to single step it and look at values in registers.** Or look at C compiler ouput for a function that would compile to similar asm, a function taking multiple `char*` args. — Peter Cordes, Jul 31 '22 at 04:35
Problem is debugger auto skips single stepping just after line 50 which corresponds to xor rbx, rbx . Anything after that begins at just after function at add rsp, 32+8. So, I can't know what happens in between. — Duke William, Jul 31 '22 at 08:03
32 bit sign extended? I thought x64 assembly has 8byte memory addresses. — Duke William, Jul 31 '22 at 08:05
It does, that's why it doesn't always work to use an address as the operand for push-immediate, thus why I commented on it. The normal way would be to `lea rax, [rel tenth]` / `push rax`, but your tutorial is simplifying with ways that work in some cases, like when symbol addresses are known to be in the low 2 GiB and you don't need position-independent code (usually you do want that for ASLR). See [How to load address of function or label into register](https://stackoverflow.com/q/57212012) / [32-bit absolute addresses no longer allowed in x86-64 Linux?](https://stackoverflow.com/q/43367427) — Peter Cordes, Jul 31 '22 at 08:08
Re: A char* in stack memory has to get loaded before the C dereference can happen to load the char. But chars are all loaded in segment. data into memory so they can be dereferenced, Aren't they? — Duke William, Jul 31 '22 at 09:27
You don't dereference a `char`, you dereference a pointer. The args to this function are all pointers, that's why the caller passed the address, not the char itself by value. That's why this is a duplicate of [x86 Nasm assembly - push'ing db vars on stack - how is the size known?](https://stackoverflow.com/q/22531645) which explains your misconception: `push tenth` pushes the address, not the character, so `mov rax, [rbp+stuff]` loads that address into a register, not the character. **Again, single-step your code with a debugger, and look at register values.** — Peter Cordes, Jul 31 '22 at 09:40
And if your debugger is skipping instructions, make sure you're telling it to single-step by instructions, not high-level statements or something. If your debugger can't do that, get a usable debugger. This is essential. You're wasting your time if you can't properly single-step, so fix your debug setup first. — Peter Cordes, Jul 31 '22 at 09:44
*So, the size of 0x44 is byte.* - No, that's a position in the stack where the argument is located. It's not the *size* of anything. It's not a `0x44` byte load, it's an 8-byte load of a pointer, from the address `rbp + 8+8+32`, i.e. `rbp + 48`. That's the address pushed by `push fourth`. i.e. it's like you did `push fourth` / `pop rax`, except it's just a `mov` load so reading it doesn't change RSP. — Peter Cordes, Jul 31 '22 at 09:47
Debugger is alright and I am clearly "stepping in" not "stepping over". Debugger hasn't had a problem with all the assembly programs I have previously tested. It is just with this code I have observed this strange behaviour. It single steps every line in main and in lfunc upto "xor rax,rax" and then it tries to debug line "mov al,byte[rdx]" and after few seconds skips lfunc altogether into main func and resumes single stepping normally the rest of the code. (I am manually stepping). — Duke William, Jul 31 '22 at 10:03
Then either your debugger is *not* alright, or you're using it wrong. I don't even use Windows so I'm not sure which one to recommend. — Peter Cordes, Jul 31 '22 at 10:09
I clearly understand it is an address that is pushed into stack, but that is not what am asking. It is dereferencing that address "mov rax, qword [rbp+8+8+32]" (mov with square brackets dereferences into load value) and then mov bl, [rax] tries again to dereference already dereferenced value at [rbp+8+8+32]. Again to be clear at rbp+8+8+32 there is an address, but [rbp+8+8+32] is the load begins at address stored in rbp+8+8+32, because square brackets dereferences what it is between them. — Duke William, Jul 31 '22 at 10:16
Ok, now I see what you're asking. You think `mov rax, qword [rbp+8+8+32]` loads from stack memory and then dereferences, a memory-indirect addressing mode. But x86 doesn't have that, only register-indirect addressing for data. **It's just reloading what was pushed, not also dereferencing it.** Like I said in a previous comment, similar to `push fourth` / `pop rax`, but separated across caller/callee. If you use a working debugger, you'll see the value in RAX matches the value on the stack, it's not an ASCII `'D'` — Peter Cordes, Jul 31 '22 at 10:38
Or to put it another way: *but [rbp+8+8+32] is the load begins at address stored in rbp+8+8+32* - No, the address `mov` loads from *is* `rbp+8+8+32`, not an extra level of derefence. `mov rax, rbp+8+8+32` won't assemble; that's not a load, it's nothing. If it had any meaning, it would be an add operation on a register value with no memory access. — Peter Cordes, Jul 31 '22 at 10:42
So, what "mov rax, [rbp+8+8+32]" does indeed is dereference RBP then does math on it and load resulting memory address into RAX. Am I right here? If so what mov bl, [RAX] does is dereference RAX and then dereference the dereferenced (memory address stored in RAX). Isn't that extra dereferencing or what you suggesting (implying) is [Reg] is different from [Reg + integer] because there is additional math there and x86 or later x86-64 circuitry is not designed for that hence it can't do dereference on [reg+int] like it does on [reg]. — Duke William, Jul 31 '22 at 11:47
`[reg]` is exactly the same as `[reg + 0]`. `[reg+ constant]` is adding to the register to get an address to access memory. The qword load is just reloading what you pushed earlier, which was itself an address. Each `mov reg, [something]` is one level of dereferencing, but note that in C terms, the values stored on the stack are still values. It's only an asm detail that it has to get loaded into a register to be dereferenced, because x86 doesn't have memory-indirect addressing. Vax did, for example, and could have accessed the byte holding `'d'` in one instruction, given a pointer in mem — Peter Cordes, Jul 31 '22 at 12:03
So no, `[rbp+8+8+32]` does not deref RBP alone and then do math on the load result. `[rbp]` would give you the caller's saved RBP value, not your stack arg. If you haven't looked at the more recent duplicates I added, see [A couple of questions about \[base + index\*scale + disp\] and AT&T disp(base, index, scale)](https://stackoverflow.com/q/27936196) and [Referencing the contents of a memory location. (x86 addressing modes)](https://stackoverflow.com/q/34058101) which describe what a single addressing mode can and can't do. Again, use a debugger to see what's going on. — Peter Cordes, Jul 31 '22 at 12:07
`mov rax, [rbp+48]` is like `add rbp, 48` / `mov rax, [rbp]`, but without changing RBP. Try it with a debugger. — Peter Cordes, Jul 31 '22 at 12:07
Alright. I think I understand. "mov rax, [rbp+48]" dereferences stack address which is a pointer to pointer of 'fourth'. "mov bl, [rax]"dereferences the pointer to symbol fourth ('0x44'). Are qword and byte prefixes necessary? Doesn't destination size imply source size? why does code seem to use 3byte addresses? is it to save memory by 5 bytes per address because program uses less than 16MB? — Duke William, Jul 31 '22 at 15:19
Yes, `rbp+48` is a pointer to a pointer. [When do I need to specify the size of the operand in Assembly?](https://stackoverflow.com/q/64324864) - only when nothing implies the size, e.g. for a shift or movzx. Yes, size is redundant for `mov` to or from a register, that's why `mov bl,[rax]` in your code assembles. NASM doesn't allow ambiguity. — Peter Cordes, Jul 31 '22 at 18:52
No idea what you're talking about with 3-byte addresses. x86 machine code doesn't ever use 3-byte addresses, that's not saving size. Addressing modes are either `[reg + disp8]` with a 1-byte displacement, or `[reg + disp32]` with a 4-byte displacement. Either way sign-extended to 64-bit. If you mean `8+8+32`, that's one constant written as an assemble-time expression. Look at the machine code with a disassembler. — Peter Cordes, Jul 31 '22 at 18:55
I mean What I see in debugger. rip 0x401015, rsp 0x60fe20, r8 0x403011 . which is 24 bits. If dissemble build pe, it is all zero extended 48bit addresses, which would implicate I am dealing with hundreds of terabytes of memory. — Duke William, Aug 01 '22 at 03:06
According to this article (https://www.ibm.com/docs/en/zos/2.4.0?topic=space-64-bit-addressing-mode-amode) there is only AMODE 24, AMODE 31 and AMODE 64 which is truncating first 40bits, 33bits and 0bits(no truncation) respectively. There is no 48bit which truncate 16bits. — Duke William, Aug 01 '22 at 03:16
But I have seen in some websites since 64bit deals with insane amount of 16 exabytes of memory, hardware is designed to use only 48bytes address space in x86-64 mode by default for current configurations. — Duke William, Aug 01 '22 at 03:20
Also, why sign extend? There is no 'minus' memory hence shouldn't addresses zero extend? — Duke William, Aug 01 '22 at 03:21
Then there is this source which is indicating middle portion of x64 address space is invalidated, which doesn't make sense because it doesn't save memory. — Duke William, Aug 01 '22 at 03:39
Original excerpt from source "Thus most architectures define an unimplemented region of the address space which the processor will consider invalid for use. x86-64 and Itanium both define the most-significant valid bit of an address, which must then be sign-extended to create a valid address. The result of this is that the total address space is effectively divided into two parts, an upper and a lower portion, with the addresses in-between considered invalid. Valid addresses are termed canonical addresses (invalid addresses being non-canonical)." — Duke William, Aug 01 '22 at 03:39
Oh, yeah, `ld` defaults to linking Linux non-PIE executables at a low address by default. Being in the low 24 bits isn't significant, it's just a nice low address far below the 2GiB boundary, allowing programs to use large static arrays almost that big. I ended up writing an answer on [Why Linux/gnu linker chose address 0x400000?](https://stackoverflow.com/a/73189318) with these reasons that existing answers didn't mention. (I included some explanation about x86-64 machine-code details that impose these limits; your comments here were helpful in realizing what a beginner might be missing.) — Peter Cordes, Aug 01 '22 at 06:00
https://www.ibm.com/docs/en/zos/2.4.0?topic=space-64-bit-addressing-mode-amode is about Z/OS on the S/390x architecture, IBM's mainframes. AMODE 24 and so on are nothing to do with with x86-64. — Peter Cordes, Aug 01 '22 at 06:03
x86-64 uses 64-bit virtual addresses, but current hardware requires them to be 48-bit or 57-bit (sign-extended), so the kernel can be in the high half (near the very top of 64-bit address space), and user-space at the bottom, regardless of how many bits the hardware actually supports. https://wiki.osdev.org/Higher_Half_Kernel. Limiting to 48-bit virtual lets the hardware use fewer tag bits in cache, and the program-counter register can be narrower, etc. Related: [ASLR and memory layout on 64 bits: Is it limited to the canonical part (128 TiB)?](https://stackoverflow.com/q/70137612) — Peter Cordes, Aug 01 '22 at 06:13
Also related: [Should pointer comparisons be signed or unsigned in 64-bit x86?](https://stackoverflow.com/q/47687805) - depends what you're trying to discover. (This is basically unrelated to your question about why `[reg + disp8]` and `[reg + disp32]` sign-extend their displacement; normally addresses themselves are consider unsigned, but 32-bit constant offsets to 64-bit pointers can be negative.) — Peter Cordes, Aug 01 '22 at 06:15

Why does a function double dereference arguments stored on stack and how is that possible?

0 Answers0