Race condition when piping through x86-64 assembly program

Question

I wrote the following simplified implementation of cat in assembly. It uses linux syscalls because I am running linux. Here's the code:

.section .data
.set MAX_READ_BYTES, 0xffff

.section .text
.globl _start

_start:
    movq (%rsp), %r10 # save the value of argc somewhere else
    movq 16(%rsp), %r9 # save the value of argv[1] somewhere else

    movl $12, %eax # syscall 12 is brk. see brk(2)
    xorq %rdi, %rdi # call with 0 as first arg to get current end of memory
    syscall
    movq %rax, %r8 # this is the address of the current end of memory

    leaq MAX_READ_BYTES(%rax), %rdi # let this be the new end of memory
    movl $12, %eax # syscall 12, brk
    syscall
    cmp %r8, %rax # compare the two; if the allocation failed, these will be equal
    je exit

    leaq -MAX_READ_BYTES(%rax), %r13 # store the start of the free area in %r13

    movq %r10, %rdi # retrieve the value of argc
    cmpq $0x01, %rdi # if there are no cli args, process stdin instead
    je stdin

    # open the file
    movl $0x02, %eax # syscall #2 = open.
    movq %r9, %rdi
    movl $0, %esi # second argument: flags. 0 means read-only.
    xorq %rdx, %rdx # this argument isn't used here, but zero it out for peace of mind.
    syscall # returns the file descriptor number in %rax
    movl %eax, %edi
    movl %edi, %r12d # first argument: file descriptor.
    call read_and_write
    jmp cleanup

stdin:
    movl $0x0000, %edi # first argument: file descriptor.
    movl %edi, %r12d # first argument: file descriptor.
    call read_and_write
    jmp cleanup

read_and_write:
    # read the file.
    movl $0, %eax # syscall #0 = read.
    movl %r12d, %edi
    movq %r13 /* pointer to allocated memory */, %rsi # second argument: address of a writeable buffer.
    movl $MAX_READ_BYTES, %edx # third argument: number of bytes to write.
    syscall # num bytes read in %rax
    movl %eax, %r15d

    # print the file
    movl $1, %eax # syscall #1 = write.
    movl $1, %edi # first argument: file descriptor. 1 is stdout.
    movq %r13, %rsi # second argument: address of data to write.
    movl %r15d, %edx # third argument: number of bytes to write.
    syscall # result ignored.
    cmpq $MAX_READ_BYTES, %r15
    je read_and_write
    ret

cleanup:
    # close the file
    movl $0x03, %eax # syscall #3 = close.
    movl %r14d, %edi # first arg: file descriptor number.
    syscall # result ignored.

exit:
    # set the exit code
    movl $60, %eax # syscall #60 = exit.
    movq $0, %rdi # exit 0 = success.
    syscall

I have assembled this into an ELF binary called asmcat. To test this program, I've got the file /tmp/random:

$ wc -c /tmp/random
94870 /tmp/random

When I run the following, the results are consistent:

$ ./asmcat /tmp/random | wc -c
94870

Here are two separate runs of the same command:

$ cat /tmp/random | ./asmcat | wc -c
65536

$ cat /tmp/random | ./asmcat | wc -c
94870

Redirecting the output to a file consistently generates files of the same size:

for i in {0..25}; do
    cat /tmp/random | ./asmcat > /tmp/asmcat-output-$i
done
for i in {0..25}; do
    wc -c /tmp/asmcat-output-$i
done

All of the resulting files have the same size, 94870. This leads me to believe that the pipe to wc is what is causing the inconsistent behavior. All my program should be doing is reading stdin, 65535 bytes at a time, and writing to stdout. It's possible that there's a bug in the program, but then, why would it consistently redirect to files of consistent sizes? So my strong feeling is that something about the piping is causing an inconsistent measure of the size of my assembly program's output.

Any feedback is welcome, including the approach taken in the assembly program (which I just wrote for fun/practice).

Did you try running this under `strace` to see the system-calls you're making? Looks like you're running into a limit on the max size of a single write into a pipe. Your code doesn't check the return value of `write` to see how many bytes each write actually copied to the output FD. Also looks like it will exit on any short read, so signals or TTY input will be a problem. Also, it would be a good idea to make your buffer a power of 2, or at least a multiple of the page size (4k), for efficiency. — Peter Cordes, Apr 18 '21 at 03:04
Thanks @PeterCordes, that's a lot of good feedback. I've never used `strace` before, but I ran though it step by step with `gdb`. I'll play around with `strace` now. When using the stack as a buffer, what is the best practice for deciding where to start? Should I pop off argc and argv into registers, and then just use `%rsp` as the starting address for the buffer? Or should I use some kind of offset? I'll come back later and edit my post to clean it up based on your suggestions. — Peter Engelbert, Apr 18 '21 at 03:34
@PeterCordes Oddly enough, I can't seem to reproduce the bad behavior when running `cat /tmp/random | strace ./asmcat | wc -c`. It always ends up with the right number of bytes. — Peter Engelbert, Apr 18 '21 at 03:42
Yeah, I noticed the same thing while writing my answer. Fortunately I was able to explain why :) My initial comment had some guesses that weren't totally accurate. — Peter Cordes, Apr 18 '21 at 03:43

Peter Cordes · Accepted Answer · 2021-04-19T02:02:14.243

TL:DR: If your program does two reads before cat can refill the pipe buffer,
the 2nd read gets only 1 byte. That makes your program decide to exit prematurely.

That's the real bug. The other design choices that make this possible are performance problems, not correctness.

Your program stops after any short-read (one where the return value is less than the requested size), instead of waiting for EOF (read() == 0). This is a simplification that's sometimes safe for regular files, but not safe for anything else, especially not a TTY (terminal input), but also not for pipes or sockets. e.g. try running ./asmcat; it exits after you press return on one line, instead of waiting for control-D EOF.

Linux pipe buffers are by default only 64kiB (pipe(7) man page), 1 byte larger than the weird odd-sized buffer you're using. After cat's write fills the pipe buffer, your 65535-byte read leaves 1 byte remaining. If your program wins the race to read the pipe before cat can write again, it reads only 1 byte.

Unfortunately, running under strace ./asmcat slows down the reads too much to observe a short-read, unless you also slow down cat or whatever other program to rate-limit the write side of your input pipe.

Rate-limited input for testing:

pv(1), the pipe-viewer, is handy for this, with rate-limit -L option, and a buffer-size limit so you can make sure its writes are smaller than 64k. (Doing a larger 64k write very infrequently might not always lead to short reads.) But if we just want short reads always, running interactively reading from a terminal is even easier. strace ./asmcat

$ pv -L8K -B16K /tmp/random | strace ./orig_asmcat | wc -c
execve("./orig_asmcat", ["./orig_asmcat"], 0x7ffcd441f750 /* 55 vars */) = 0
brk(NULL)                               = 0x61c000
brk(0x62bfff)                           = 0x62bfff
read(0, "=head1 NAME\n\n=for comment  Gener"..., 65535) = 819
write(1, "=head1 NAME\n\n=for comment  Gener"..., 819) = 819
close(0)                                = 0
exit(0)                                 = ?
+++ exited with 0 +++   # end of strace output
819                     # wc output
 819 B 0:00:00 [4.43KiB/s] [>              ]  0%        # pv's progress bar

vs. with a bugfixed asmcat, we get the expected sequence of short-reads and equal-sized writes. (See below for my version)

execve("./asmcat", ["./asmcat"], 0x7ffd8c58f600 /* 55 vars */) = 0
read(0, "=head1 NAME\n\n=for comment  Gener"..., 65536) = 819
write(1, "=head1 NAME\n\n=for comment  Gener"..., 819) = 819
read(0, "check if a\nnamed variable exists"..., 65536) = 819
write(1, "check if a\nnamed variable exists"..., 819) = 819

Code review

There are multiple wasted instructions, e.g. a mov that writes a register you never read again, like setting EDI before a call, but then the function call takes R12D as the arg, instead of the standard calling convention.

Reading argc, argv early instead of just leaving them on the stack until they're needed is similarly redundant.

.data is pointless: .set is an assemble-time constant. It doesn't matter what the current section is when you define it. You could also write it as MAX_READ_BYTES = 0xffff, more natural syntax for assemble-time constants.

You could allocate your buffer on the stack instead of with brk (it's only 64K - 1, and x86-64 Linux allows 8MiB stacks by default), in which case loading early could make sense. Or just use the BSS, e.g. lcomm buf, 1<<16

It would be a good idea to make your buffer a power of 2, or at least a multiple of the page size (4k), for efficiency. If you use it to copy files, every read after the first one will start near the end of a page, instead of copying a whole number of 4k pages, so the kernel's copy_to_user (read) and copy_from_user (write) will be touching 17 pages of kernel memory per read/write instead of 16. The pagecache for the file data may not be in contiguous kernel addresses, so each separate 4k page takes some overhead to find, and start a separate memcpy for (rep movsb on modern CPUs with the ERMSB feature). Also for disk I/O, the kernel will have to buffer your writes back into aligned chunks of some multiple of the HW sector size and/or filesystem block size.

64KiB is clearly a good choice when reading from pipes, for the same reason this race was possible. Leaving 1 byte is obviously inefficient. Also, 64k is smaller than L2 cache sizes, so the copy to/from user-space (inside the kernel in your system calls) can re-read from L2 cache when you write again. But smaller sizes mean more system calls, and each system-call has significant overhead (especially with Meltdown and Spectre mitigation in modern kernels.)

64KiB to 128KiB is about a sweet spot for buffer size, given 256KiB L2 caches being typical. (Related: code golf: Fastest yes in the West tunes a program that just makes write system-calls, with x86-64 Linux, with profiling / benchmark results on my Skylake desktop.)

Nothing in the machine code benefits from the size fitting in a uint16_t like 0xFFFF does; either int8_t or int32_t are relevant for immediate operand sizes in 64-bit code. (Or uint32_t if you're zero-extending like mov $imm32, %edx to zero-extend into RDX.)

Don't close stdin; you run close unconditionally. closing stdin doesn't affect the parent process's stdin so it shouldn't be a problem in this program, but the whole point of close seems to be to make this more like a function you could use from a large program. So you should separate your copying fd to stdout from the file-handling.

Use #include <asm/unistd.h> to get call numbers instead of hardcoding them. They're guaranteed stable, but it's more human readable / self-documenting to just use the named constants, and avoids any risk of copying errors. (Build with gcc -nostdlib -static asmcat.S -o asmcat; GCC runs .S files through the C preprocessor before assembling, unlike .s)

Style: I like to indent operands to a consistent column so they're not crowding mnemonics. Similarly, comments should be comfortably to the right of operands so you can scan down the column for instructions accessing any given register without getting distracted by comments on shorter instructions.

Comment content: The instruction itself already says what it does, the comment should describe the semantic meaning. (I don't need comments to remind me of calling conventions, like that system calls leave a result in RAX, but even if you do, summarizing the system call with a C version of it can be a good reminder of which arg is which. Like open(argv[1], O_RDONLY).)

I also like to remove redundant operand-size suffixes; the register sizes imply operand-size (just like Intel-syntax). Note that zeroing a 64-bit register only requires xorl; writing a 32-bit register implicitly zero-extends to 64-bit. Your code is sometimes inconsistent about whether things should be 32 or 64-bit. In my rewrite, I used 32-bit everywhere I could. (Except cmp %rax, %rdx return value from write, which seemed like a good idea to make 64-bit, although I don't think there's any real reason.)

My rewrite:

I removed the call/ret stuff, and just let it fall through into cleanup/exit instead of trying to separate it into "functions".

I also changed the buffer size to 64KiB exactly, allocated on the stack with 4k page alignment, and rearranged things to simplify and save instructions everywhere.

Also added a # TODO comment about short writes. That doesn't seem to happen for pipe writes up to 64k; Linux just blocks the write until the buffer has room, but could be a problem writing to a socket maybe? Or maybe only with a larger size, or if a signal like SIGTSTP or SIGSTOP interrupts write()

#include <asm/unistd.h>
BUFSIZE = 1<<16

.section .text
.globl _start
_start:
    pop  %rax      # argc
    pop  %rdi
    pop  %rdi      # argv[1]
     # you'd only ever want to read args this way in _start, which isn't a function

    and  $-4096, %rsp           # round RSP down to a page boundary.
    sub  $BUFSIZE, %rsp         # reserve 64K buffer aligned by 4k

    dec  %eax      # if argc == 1,  then run with input fd = 0   (stdin)
    jz  .Luse_stdin

    # open argv[1]
    mov     $__NR_open, %eax 
    xor     %esi, %esi     # flags: 0 means read-only.
    xor     %edx, %edx     # mode unused without O_CREAT, but zero it out for peace of mind.
    syscall       # fd = open(argv[1], O_RDONLY)

.Luse_stdin:           # don't use stdin as a symbol name; stdio.h / libc also has one of type FILE*
    mov  %eax, %ebx     # save FD
    mov  %rsp, %rsi     # always read and write the same buffer
    jmp  .Lentry        # start with a read then EOF-check as loop condition
              # since we're now error-checking the write,
              # rotating the loop maybe wasn't helpful after all
              # and perhaps just read at the top so we can fall into it would work equally well

read_and_write:              # do {
    # print the file
    mov     %eax, %edx             # size = read_size
    mov     $__NR_write, %eax      # syscall #1 = write.
    mov     $1, %edi               # output fd always stdout
    #mov     %rsp, %rsi             # buf, done once outside loop
    syscall                        # write(1, buf, read_size)

    cmp     %rax, %rdx             # written size should match request
    jne     cleanup                 # TODO: handle short writes by calling again for the unwritten part of the buffer, e.g. add %rax, %rsi
                                    # but also check for write errors.
.Lentry:
     # read the file.
    mov    $__NR_read, %eax     # xor  %eax, %eax
    mov    %ebx, %edi           # input FD
   # mov    %rsp, %rsi           # done once outside loop
    mov    $BUFSIZE, %edx
    syscall                     # size = read(fd, buf, BUFSIZE)

    test   %eax, %eax
    jg     read_and_write    # }while(read_size > 0);   // until EOF or error
# any negative can be assumed to be an error, since we pass a size smaller than INT_MAX

cleanup:
# fd might be stdin which we don't want to close.
# just exit and let kernel take care of it, or check for fd==0
#    movl $__NR_close, %eax
#    movl %ebx, %edi 
#    syscall          # close (fd)  // return value ignored

exit:
    mov  %eax, %edi             # exit status = last syscall return value. read() = 0 means EOF, success.
    mov  $__NR_exit_group, %eax
    syscall                     # exit_group(status);

For instruction counts, perf stat --all-user ./asmcat /tmp/random > /dev/null shows it runs about 47 instructions in user-space, vs. 57 for yours. (IIRC, perf over-counts by 1, so I've subtracted that from the measured result.) And that's with more error-checking, e.g. for short writes.

This is only 84 bytes of machine code in the .text section (vs. 174 bytes for your original), and I didn't optimize for size over speed with stuff like lea 1(%rsi), %eax (after zeroing RSI) instead of mov $1, %eax. (Or with mov %eax, %edi to take advantage of _NR_write == STDIN_FILENO.)

I mostly avoided R8..R15 because they need REX prefixes to access in the machine code.

Tests of error handling:

$ gcc -nostdlib -static asmcat.S -o asmcat            # build
$ cat /tmp/random | strace ./asmcat > /dev/full

execve("./asmcat", ["./asmcat"], 0x7ffde5e369d0 /* 55 vars */) = 0
read(0, "=head1 NAME\n\n=for comment  Gener"..., 65536) = 65536
write(1, "=head1 NAME\n\n=for comment  Gener"..., 65536) = -1 ENOSPC (No space left on device)
exit_group(-28)                         = ?
+++ exited with 228 +++

$ strace ./asmcat <&-      # close stdin
execve("./asmcat", ["./asmcat"], 0x7ffd0f5048c0 /* 55 vars */) = 0
read(0, 0x7ffc1b3ca000, 65536)          = -1 EBADF (Bad file descriptor)
exit_group(-9)                          = ?
+++ exited with 247 +++

$ strace ./asmcat /noexist
execve("./asmcat", ["./asmcat", "/noexist"], 0x7ffd429f1158 /* 55 vars */) = 0
open("/noexist", O_RDONLY)              = -1 ENOENT (No such file or directory)
read(-2, 0x7ffd4f296000, 65536)         = -1 EBADF (Bad file descriptor)
exit_group(-9)                          = ?
+++ exited with 247 +++

Hmm, should probably test/jl on the fd after open, if you wanted to do error handling.

Really excellent detail here. I'm rewriting the program now. Thanks so much for taking the time to explain this so clearly. Any recommendations on where I can read more about memory paging in linux? — Peter Engelbert, Apr 18 '21 at 04:35
@PeterEngelbert: I just updated with my rewrite, if you want to compare my choices against your rewrite. Re: kernel memory paging: https://www.kernel.org/doc/html/latest/admin-guide/mm/index.html is an overview / top-level of kernel docs for that very broad subject, including https://www.kernel.org/doc/html/latest/admin-guide/mm/concepts.html for the concepts overview. https://www.oreilly.com/library/view/system-performance-tuning/059600284X/ch04.html is a chapter of an *old* sysadmin book on tuning Linux systems — Peter Cordes, Apr 18 '21 at 04:55
@PeterEngelbert: Probably also a good idea to have a look at [What Every Programmer Should Know About Memory?](https://stackoverflow.com/q/8126311) for more about cache. But really, there's a lot of background info / concepts that you'd get from an operating-systems textbook. (and/or taking an undergrad CS class in operating systems if you're already a student.) — Peter Cordes, Apr 18 '21 at 04:56
I've been working my way through the Bryant/O'Halloran book. I'm trying to get better at asm through practice before moving on, but I'm assuming the later chapters will have more to say about memory management. Since it's a lazy sunday I'm goign to check out those resources you sent :) — Peter Engelbert, Apr 18 '21 at 17:35
@PeterEngelbert: Forgot to mention, other than an OS textbook, https://en.wikipedia.org/wiki/Virtual_memory#Paged_virtual_memory is pretty important background knowledge about the basics of memory pages. Most modern CPUs follow that model, and Linux doesn't do anything super weird with it. IDK what level of basics you were ready to start with. — Peter Cordes, Apr 18 '21 at 21:31
I am trying to `#include ` like you have, but I get errors during the linking stage. I am assembling using the GNU `as` and linking using `ld`. I'm guessing I have to provide some option to `ld` to find those constants (such as `__NR_READ`. For the time being I worked around this by defining the constants myself at the top of the file, but it would be nice to know how to include header files properly. — Peter Engelbert, Apr 19 '21 at 01:32
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/231297/discussion-between-peter-engelbert-and-peter-cordes). — Peter Engelbert, Apr 19 '21 at 01:34
@PeterEngelbert: Like I said in one paragraph buried in the middle of my answer, `gcc ... foo.S` runs it through the C preprocessor *before* assembling; that paragraph also included the exact GCC options you want. `as` itself can't do anything with `#include` / `#define`. You can use `gcc -v` to see the actual commands it runs for that, and to assemble + link. — Peter Cordes, Apr 19 '21 at 01:36

Race condition when piping through x86-64 assembly program

1 Answers1

Rate-limited input for testing:

Code review

My rewrite:

Linked