Data alignment:
It is typical to think of memory as a flat array of bytes:
Data: | 0x50 | 0x69 | 0x70 | 0x43 | 0x68 | 0x69 | 0x70 | 0x73 |
Address: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
ASCII: | P | i | p | C | h | i | p | s |
However, the CPU itself does not read, nor write, data to memory a byte at a time. Efficiency is the name of the game, therefore, a computer's CPU will read data from memory, a fixed number of bytes at a time. The size in which the processor accesses memory is know as its memory access granularity (MAG).
Memory access granularity varies across architectures. As a general rule, MAG is equal to the native word size of the processor in question, IE. IA-32 will have a 4-byte granularity.
If the CPU were to only read one byte at a time from memory, it would need to access memory 8 times in order to read the entirety of the above array. Compare this to if the CPU were to access memory 4-bytes at a time, a 4-byte granularity. In this case, the CPU would need to access memory only twice; 1 = bytes 0-3, 2 = bytes 4-7.
Where does memory alignment come into play:
Well, let’s assume a 4-byte MAG. As we have seen, in order to read the string “PipChips” from memory, the CPU would need to access memory twice. Now, let’s assume that the data was aligned in memory slightly differently. Let’s assume, the following:
Data: | 0x6B | 0x50 | 0x69 | 0x70 | 0x43 | 0x68 | 0x69 | 0x70 | 0x73 |
Address: | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
ASCII: | k | P | i | p | C | h | i | p | s |
In this example, to access the same data, the CPU would need to access memory a total of 3 times; 1 = bytes 0-3, 2 = bytes 4-7 and a third time to access “s”, at memory address 8. Furthermore, the processor would have to perform additional work, in order to shift-out the unwanted bytes, that were unnecessarily read from memory, due to the data being stored at an unaligned address.
This is where memory alignment comes in to play. The CPU has a MAG, the main purpose of which is to increase machine efficiency. Therefore, aligning data in memory to match the machines memory access boundaries creates more efficient code.
This is a(n) (overly) simplistic explanation of memory alignment, however it answers the question:
1) What is causing this empty address space, as it is not included in my source code?
The ‘empty address space’ is generated by the alignment requirements of the SECTION
data. NASM defaults are assumed, if you do not specify values for the section properties. Please see the manual.
2) What are the specific reasons for this empty address space?
The overriding reason for aligning memory data is for software efficiency and robustness. As discussed, the processor will access memory at the granularity of its word size.
3) How is the size of the space calculated?
The assembler will pad out the section, so that the data immediately following on from it, is automatically aligned to an instance of the specified memory access boundary. In the original question, section .data
would have ended at address 0x… 60012a
, in the absence of the necessary padding, with section .bss
starting at address 60012b. Here, the data would not have been properly aligned with the memory access boundary defined by the CPU's access granularity. Consequently, NASM, in its wisdom, adds a padding of one nul
character, in order to round the memory address up to the next address that is divisible by 4, and hence, properly align the data.
The subtleties of memory access are many; for a more in-depth explanation, please see the wiki, and numerous on-line articles, e.g. here; and for the masochistic among you, there are always the manuals!
Generally, data alignment is handled automatically by the complier/assembler, although programmer control is an option and in some cases desirable.
…………………………………………………………………………………………………………................................
Solving the original problem:
We are still left with the question of how to concatenate our two strings for output. We know now that the implementation of concatenating two strings across sections is not ideal, to say the least. Generally, we will not know where these sections are placed, in relation to each other, during runtime.
It is preferable therefore, to concatenate these strings in a region in memory, before making the syscall
; as opposed to relying on the system call to provide the concatenation, based on assumptions of where the strings ought to be in memory.
We have several options:
Make two sys_write
calls in succession, in order to print both strings, and give the illusion in the output that they are one: Although straight forward, this makes little sense, as system calls are expensive.
Directly read the user input into place: This seems the logical and most efficient thing to do, at least at first glance. As we can write the string without moving any data around, and with only one syscall
. However, we face the problem of inadvertently overwriting data, as we have not reserved the space in memory. Also, it seems ‘wrong’ to read user input to the initialized .data
section; initialized data is data that has a value before the program begins!
Moving ‘EncodedName’ in memory, so that it is contiguous with ‘OutputMsg’: This seems clean and simple. However, in reality it is not really any different to option 2, and suffers the same drawbacks.
The solution: Create a memory buffer and concatenate the strings into this memory buffer, prior to the sys_write
system call.
SECTION .bss
EncodedName: resb ENCODELEN
ENCODELEN: equ 1024
CompleteOutput: resb COMPLETELEN
COMPLETELEN: equ 2048
User input will be read to ‘EncodedName’. We then concatenate ‘OutputMsg’ and ‘EncodedName’ at ‘CompleteOutput’, ready for writing to stdout:
; Read user input from stdin:
mov rax,0 ; sys_read
mov rdi,0 ; stdin
mov rsi,EncodedName ; Memory offset in which to read input data
mov rdx,ENCODELEN ; Length of memory buffer
syscall ; Kernel call
mov r8,rax ; Save the number of bytes read by stdin
; Move string 'OutputMsg' to memory address 'CompleteOutput':
mov rdi,CompleteOutput ; Destination memory address
mov rsi,OutputMsg ; Offset of 'string' to move to destination
mov rcx,OUTPUTLEN ; Length of string being moved
rep movsb ; Move string, iteration, per byte
; Concatenate 'OutputMsg' with 'EncodedName' in memory:
mov rdi,CompleteOutput ; Destination memory address
add rdi,OUTPUTLEN ; Add length of string already moved, so we append strings, as opposed to overwrite
mov rsi,EncodedName ; Offset memory address of string being moved
mov rcx,r8 ; String length, during sys_read, the number of bytes read was saved in r8
rep movsb ; Move string into place
; Write string to stdout:
mov rdx,OUTPUTLEN ; Length of 'OutputMsg'
add rdx,r8 ; add length of 'EncodedName'
mov rax,1 ; sys_write
mov rdi,1 ; stdout
mov rsi,CompleteOutput ; Memory offset of string
syscall ; Make system call
*Credit due to the comments in the original question, for pointing me in the right direction.