it appears to work... but the terminal seems to mimick the input I type in again.
No, the 5 + newline that bash
reads is the one you typed. Your program waited for input but didn't actually read the input, leaving it in the kernel's terminal input buffer for bash
to read after your program exited. (And bash
does its own echoing of terminal input because it puts the terminal in no-echo mode before reading; the normal mechanism for characters to appear on the command line as you type is for bash to print what it reads.)
How did your program manage to wait for input without reading any? mov rsi, [rsp-8]
loads 8 bytes from that address. You should have used lea
to set rsi
to point to that location instead of loading what was in that buffer. So read
fails with -EFAULT
instead of reading anything, but interestingly it doesn't check this until after waiting for there to be some terminal input.
I used strace ./foo
to trace system calls made by your program:
execve("./foo", ["./foo"], 0x7ffe90b8e850 /* 51 vars */) = 0
read(0, 5
NULL, 1) = -1 EFAULT (Bad address)
write(1, "Hello, World\n", 13Hello, World
) = 13
exit(0) = ?
+++ exited with 0 +++
Normal terminal input/output is mixed with the strace output; I could have used -o foo.trace
or whatever. The cleaned-up version of the read
system call trace (without the 5\n
mixed in) is:
read(0, NULL, 1) = -1 EFAULT (Bad address)
So (as expected for _start
in a static executable under Linux), the memory below RSP was zeroed. But anything that isn't a pointer to writeable memory would have produced the same result.
zx485's answer is correct but inefficient (large code-size and an extra instruction). You don't need to worry about efficiency right away, but it's one of the main reasons for doing anything with asm and there's interesting stuff to say about this case.
You don't need to modify RSP; you can use the red-zone (memory below RSP) because you don't need to make any function calls. This is what you were trying to do with rsp-8
, I think. (Or else you didn't realize that it was only safe because of special circumstances...)
The read
system call's signature is
ssize_t read(int fd, void *buf, size_t count);
so fd
is an integer arg, so it's only looking at edi
not rdi
. You don't need to write the full rdi
, just the regular 32-bit edi
. (32-bit operand-size is usually the most efficient thing on x86-64).
But for zero or positive integers, just setting edi
also sets rdi
anyway. (Anything you write to edi
is zero-extended into the full rdi
) And of course zeroing a register is best done with xor same,same
; this is probably the best-known x86 peephole optimization trick.
As the OP later commented, reading only 1 byte will leave the newline unread, when the input is 5\n
, and that would make bash read it and print an extra prompt. We can bump up the size of the read and the space for the buffer to 2 bytes. (There'd be no downside to using lea rsi, [rsp-8]
and leave a gap; I'm using lea rsi, [rsp-2]
to pack the buffer right below argc
on the stack, or below the return value if this was a function instead of a process entry point. Mostly to show exactly how much space is needed.)
; One read of up to 2 characters
; giving the user room to type a digit + newline
_start:
;mov eax, 0 ; set SYS_READ as SYS_CALL value
xor eax, eax ; rax = __NR_read = 0 from unistd_64.h
lea rsi, [rsp-2] ; rsi = buf = rsp-2
xor edi, edi ; edi = fd = 0 (stdin)
mov edx, 2 ; rdx = count = 2 char
syscall ; sys_read(0, rsp-2, 2)
; total = 16 bytes
This assembles like so:
+ yasm -felf64 -Worphan-labels -gdwarf2 foo.asm
+ ld -o foo foo.o
ld: warning: cannot find entry symbol _start; defaulting to 0000000000400080
$ objdump -drwC -Mintel
0000000000400080 <_start>:
400080: 31 c0 xor eax,eax
400082: 48 8d 74 24 ff lea rsi,[rsp-0x1]
400087: 31 ff xor edi,edi
400089: ba 01 00 00 00 mov edx,0x1
40008e: 0f 05 syscall
; next address = ...90
; I left out the rest of the program so you can't actually *run* foo
; but I used a script that assembles + links, and disassembles the result
; The linking step is irrelevant for just looking at the code here.
By comparison, zx485's answer assembles to 31 bytes. Code size is not the most important thing, but when all else is equal, smaller is better for L1i cache density, and sometimes decode efficiency. (And my version has fewer instructions, too.)
0000000000400080 <_start>:
400080: 48 c7 c0 00 00 00 00 mov rax,0x0
400087: 48 83 ec 08 sub rsp,0x8
40008b: 48 c7 c7 00 00 00 00 mov rdi,0x0
400092: 48 8d 34 24 lea rsi,[rsp]
400096: 48 c7 c2 01 00 00 00 mov rdx,0x1
40009d: 0f 05 syscall
; total = 31 bytes
Note how those mov reg,constant
instructions use the 7-byte mov r64, sign_extended_imm32
encoding. (NASM optimizes those to 5-byte mov r32, imm32
for a total of 25 bytes, but it can't optimize mov
to xor
because xor
affects flags; you have to do that optimization yourself.)
Also, if you are going to modify RSP to reserve space, you only need mov rsi, rsp
not lea
. Only use lea reg1, [rsp]
(with no displacement) if you're padding your code with longer instructions instead of using a NOP for alignment. For source registers other than rsp
or rbp
, lea
won't be longer but it is still slower than mov
. (But by all means use lea
to copy-and-add. I'm just saying it's pointless when you can replace it with a mov
.)
You could save even more space by using lea edx, [rax+1]
instead of mov edx,1
at essentially no performance cost, but that's not something compilers normally do. (Although perhaps they should.)