7

Recently I was playing with freebsd system calls I had no problem for i386 part since its well documented at here. But i can't find same document for x86_64.

I saw people are using same way like on linux but they use just assembly not c. I suppose in my case system call actually changing some register which is used by high optimization level so it gives different behaviour.

/* for SYS_* constants */
#include <sys/syscall.h>

/* for types like size_t */
#include <unistd.h>

ssize_t sys_write(int fd, const void *data, size_t size){
    register long res __asm__("rax");
    register long arg0 __asm__("rdi") = fd;
    register long arg1 __asm__("rsi") = (long)data;
    register long arg2 __asm__("rdx") = size;
    __asm__ __volatile__(
        "syscall"
        : "=r" (res)
        : "0" (SYS_write), "r" (arg0), "r" (arg1), "r" (arg2)
        : "rcx", "r11", "memory"
    );
    return res;
}

int main(){
    for(int i = 0; i < 1000; i++){
        char a = 0;
        int some_invalid_fd = -1;
        sys_write(some_invalid_fd, &a, 1);
    }
    return 0;
}

In above code I just expect it to call sys_write 1000 times then return main. I use truss to check system call and their parameters. Everything works fine with -O0 but when I go -O3 for loop getting stuck forever. I believe system call changing i variable or 1000 to something weird.

Dump of assembler code for function main:

0x0000000000201900 <+0>:     push   %rbp
0x0000000000201901 <+1>:     mov    %rsp,%rbp
0x0000000000201904 <+4>:     mov    $0x3e8,%r8d
0x000000000020190a <+10>:    lea    -0x1(%rbp),%rsi
0x000000000020190e <+14>:    mov    $0x1,%edx
0x0000000000201913 <+19>:    mov    $0xffffffffffffffff,%rdi
0x000000000020191a <+26>:    nopw   0x0(%rax,%rax,1)
0x0000000000201920 <+32>:    movb   $0x0,-0x1(%rbp)
0x0000000000201924 <+36>:    mov    $0x4,%eax
0x0000000000201929 <+41>:    syscall 
0x000000000020192b <+43>:    add    $0xffffffff,%r8d
0x000000000020192f <+47>:    jne    0x201920 <main+32>
0x0000000000201931 <+49>:    xor    %eax,%eax
0x0000000000201933 <+51>:    pop    %rbp
0x0000000000201934 <+52>:    ret

What is wrong with sys_write()? Why for loop getting stuck?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
fsdfhdsjkhfjkds
  • 308
  • 1
  • 9
  • 1
    You should see what -O3 do deeply. – TZof Mar 30 '21 at 20:47
  • 1
    On `x86_64`, per the ABI, the first 6 arguments are passed in [specific] registers: https://en.wikipedia.org/wiki/X86_calling_conventions By using (e.g.) `__asm__("rdi")` you are breaking that. Also, you need to look at the _syscall_ ABI for which arg goes in which register, so see: https://stackoverflow.com/questions/2535989/what-are-the-calling-conventions-for-unix-linux-system-calls-and-user-space-f Doing inline asm can be tricky. Unless you need to, you could use the `syscall` library function. Also, leave off `register`--the optimizer will do better. – Craig Estey Mar 30 '21 at 20:52
  • 1
    If you don't get an answer here, I strongly suggest you ask on the mailing lists or irc where the developers hang out. – Rob Mar 30 '21 at 20:53
  • Is using intrinsics instead of `asm` an option here? – tadman Mar 30 '21 at 21:00
  • @CraigEstey isn't register order i use right? rdi is first parameter I assing it to fd which is also first parameter for SYS_write system call. Also we cant remove register keyword in begin of variable because we want to make it rdi register, if we remove it compiler will ignore it. – fsdfhdsjkhfjkds Mar 30 '21 at 21:01
  • @tadman I would like to learn how it works but it also would be nice to learn intrinsics way too. – fsdfhdsjkhfjkds Mar 30 '21 at 21:07
  • 3
    @CraigEstey: `register T foo asm("regname")` *is* a valid way of making an `"r"` constraint pick the register you want. (And in fact the only supported use-case for [GNU C register-asm local vars](https://gcc.gnu.org/onlinedocs/gcc/Local-Register-Variables.html)) It's a lot more clunky than `"D"(fd)`, but it's the only option for r8..r15, and on other ISAs that don't have specific-register constraints. I don't see a problem (yet), assuming x86-64 FreeBSD uses the same system-call ABI as x86-64 Linux, even using a "memory" clobber so the compiler knows the pointed-to memory must be in sync. – Peter Cordes Mar 30 '21 at 21:15
  • @CraigEstey: e.g. [How to invoke a system call via syscall or sysenter in inline assembly?](https://stackoverflow.com/a/9508738) uses basically the same definition for Linux. But without that pointless casting and extension of the args to signed `long`; if you're writing a wrapper for a specific syscall, you can use the exact types instead of trying to make it generic. – Peter Cordes Mar 30 '21 at 21:19
  • What register does your compiler keep `i` in, with optimization? (And what compiler / version are you using? GCC and clang should both respect the constraints and clobbers, so it should be fine unless FreeBSD system calls destroy some registers beyond RAX, RCX, and R11. Single-stepping the asm with a debugger should tell you whether any other regs are getting modified.) – Peter Cordes Mar 30 '21 at 21:31
  • @PeterCordes I edited thread to add disassembled code. I suppose it's `r8d`. `FreeBSD clang version 10.0.1 (git@github.com:llvm/llvm-project.git llvmorg-10.0.1-0-gef32c611aa2) Target: x86_64-unknown-freebsd12.2` I tried `catch syscall write` `info registers` but that problem not happening when I use gdb. – fsdfhdsjkhfjkds Mar 30 '21 at 21:50
  • When you do `-O3`, `gcc` will _inline_ the call to `sys_write` into `main`, so disassemble `main` as well. – Craig Estey Mar 30 '21 at 21:52
  • Yeah, it's using R8D as the loop counter, obeying the constraints you used. So the it's not a compiler bug. You say it works differently when you run under GDB, and the `syscall` doesn't modify R8 in that case? That's certainly plausible; the details of the system-call ABI may vary between OSes, unlike the function-calling convention for systems using x86-64 System V. – Peter Cordes Mar 30 '21 at 21:52
  • @CraigEstey: That is disassembly for `main`, note the movb store, then `syscall`, in the `add r8d, -1` / `jnz` loop. (With [a dummy read-only memory input operand](https://stackoverflow.com/questions/56432259/how-can-i-indicate-that-the-memory-pointed-to-by-an-inline-asm-argument-may-be), the compiler would know that the asm statement only reads memory, not modifies, so it wouldn't have to redo the store every iteration. e.g. `"m"(*(const char (*)[size]) data)`. But a "memory" clobber is more of a blunt instrument.) – Peter Cordes Mar 30 '21 at 21:53
  • Do a `b main` and then `run`. Then, do: `x/i $rip`, `info registers`, `si` [lather rinse repeat] – Craig Estey Mar 30 '21 at 21:59
  • @PeterCordes I checked r8 register it decreasing one by one from 1000 to 0 without any problem when I run with gdb. – fsdfhdsjkhfjkds Mar 30 '21 at 22:00
  • What happens if you decrease 1000 to something smaller (e.g. 10)? With `si`, the issue may happen sooner – Craig Estey Mar 30 '21 at 22:02
  • @CraigEstey Its broken above and `i < 43` but works below and `i < 42`. – fsdfhdsjkhfjkds Mar 30 '21 at 22:06
  • Do a disassembly of your `libc`'s `write` function. What does it do that your syscall doesn't? (e.g. save registers to stack, etc.) BTW, I'd doing it under linux and it runs fine. You could also run under `strace` to see if the "interference" is `gdb` or `ptrace` (i.e. if doing `ptrace` on your app "fixes" the issue (e.g. both gdb and strace work), then `ptrace` does something. If [only] `gdb` "fixes" it, that is a clue). – Craig Estey Mar 30 '21 at 22:11
  • @CraigEstey I don't think it's good idea to try get knowledge from disassemble something that has source code. But thanks for idea I'll try dig libc code. Also I use kinda same code in linux too it works. I use `truss` as `strace` alternative and like gdb it also fixes problem. – fsdfhdsjkhfjkds Mar 30 '21 at 22:21
  • Do you have linux emulation installed? See: https://docs.freebsd.org/en/books/developers-handbook/x86-system-calls.html – Craig Estey Mar 30 '21 at 22:23
  • Disassembly will save you tons of reading if macros are in the libc source. BTW, from the link, BSD syscalls with `int 0x80` use different registers. So, you need linux emulation installed??? From the link, it does _not_ specify use of `syscall` inst, but _only_ `int 0x80`. – Craig Estey Mar 30 '21 at 22:29
  • @CraigEstey: From the POV of the compiler, a libc wrapper is already a function, so it already has to be assumed to clobber all the call-clobbered registers (which includes the arg-passing registers like r8). So you wouldn't see explicit save/restore of any registers except possibly RBX/RBP, and R12-R15. I'd be surprised about that, but I could imagine that FreeBSD kernels clobber R8-R11 on `syscall`, and maybe also other arg-passing regs like RDI, RSI, and RDX. One way to test for that is to set all regs to a "unique" value like 0xdeadbeefdeadbeef and `si` past `syscall`. – Peter Cordes Mar 30 '21 at 22:39
  • @CraigEstey I queried packages `linux_base-c7` `linux-c7` with `pkg query %n` I dont have these installed. Also I checked clang output file with binwalk it says `ELF, 64-bit LSB executable, AMD x86-64, version 1 (FreeBSD)` which means its not linux branded. – fsdfhdsjkhfjkds Mar 30 '21 at 22:41
  • Hmm, I guess it's possible that having tracing / single-stepping enabled makes the kernel do something different in its entry-point, like save *all* the registers so the process using ptrace can see all of them before the system call. But without tracing, maybe just dispatches to the right function-pointer in the system-call table, letting the C calling convention take care of saving the call-preserved regs. (That could leak kernel data back to user-space if it's exactly like that, but some variation maybe.) To test that, set a breakpoint *after* `syscall` and use GDB `c` instead of `si`. – Peter Cordes Mar 30 '21 at 22:43
  • @CraigEstey Also I noted in top of thread that page shows i386 system calls not x86_64. – fsdfhdsjkhfjkds Mar 30 '21 at 22:45
  • @CraigEstey: I think FreeBSD, like MacOS, uses `syscall` on x86-64. They'd have to be insane to only use the slow `int $0x80` as their system-call ABI in 64-bit mode. Anyway, https://github.com/freebsd/freebsd-src/blob/098dbd7ff7f3da9dda03802cdb2d8755f816eada/sys/amd64/amd64/exception.S#L516 is the source for the `syscall` entry-point into an AMD64 kernel. I haven't read the source yet to see what it does, but it looks like it's saving all the regs to the stack. (Like Linux does). There's a mention in a later comment of r10 being possibly clobbered (when "profiling"?) but not R8. – Peter Cordes Mar 30 '21 at 22:54
  • @PeterCordes I did `break *0x000000000020192b` then `info registers` when break happened. `r8` is zero. Program still gets stuck in this case. – fsdfhdsjkhfjkds Mar 30 '21 at 22:59
  • 2
    I've seen the likes of this. There was a change to the Linux kernel some years ago where some system calls started trashing call-clobbered registers and broke a lot of my code. The for loop is in r8d, which is call-clobbered. Did the same thing happen to you? – Joshua Mar 31 '21 at 02:36
  • @PeterCordes I cobbled together a register save mechanism similar to a logic analyzer to help in diagnosing the problem. It's a bit crude but similar to what I've done in sticky circumstances in the past. – Craig Estey Mar 31 '21 at 02:38
  • @Joshua: I wasn't really messing around with asm 10 years ago, and other than experiments, I don't have any code that inlines `syscall`. So I don't have personal experience with that. – Peter Cordes Mar 31 '21 at 02:53
  • 1
    @PeterCordes: It saves the registers to the stack all right, but check out [the return code](https://github.com/freebsd/freebsd-src/blob/098dbd7ff7f3da9dda03802cdb2d8755f816eada/sys/amd64/amd64/exception.S#L605) where it *doesn't* restore them. `rsi, rdi` and the call-preserved registers are saved, but `r8, r9, r10` are explicitly zeroed. – Nate Eldredge Mar 31 '21 at 03:14
  • @NateEldredge: Thanks, I'd just posted an answer with some guesses when I saw your comment; updated now that we have the real answer. (RDX is also restored, so the call-clobbered "non-legacy" regs are clobbered). – Peter Cordes Mar 31 '21 at 04:11
  • @PeterCordes: I wasn't quite sure about `rdx` given the comment "return value 2". Could there be some system call that returns a 128-bit return value in `rdx:rax`? – Nate Eldredge Mar 31 '21 at 04:16
  • @NateEldredge: Oh! I didn't notice that comment. Hmm. – Peter Cordes Mar 31 '21 at 04:17
  • @NateEldredge: I think that comment is bogus. It's reloading the saved `TF_RDX` slot, not using the RDX:RAX return value from `call amd64_syscall` like it does for RAX. So unless `amd64_syscall` itself modifies RDX for specific system calls with wide return values, it's restoring the incoming value. – Peter Cordes Mar 31 '21 at 04:21
  • @PeterCordes: It also reloads `rax` from `TF_RAX`, so I think there must be somewhere in the C code that writes the system call return value over the saved `rax`. (It doesn't come from whatever `amd64_syscall` leaves in `rax` because the following instruction at line 585 overwrites it.) If I could find it, it might be clear whether the saved `rdx` is also overwritten there. – Nate Eldredge Mar 31 '21 at 04:24
  • @NateEldredge: oh yes, I see now. Yeah, OP didn't report anything about `size` changing, but maybe it is. – Peter Cordes Mar 31 '21 at 04:27
  • Since problem doesn't happen with trace/singlestep I put break to same address (after syscall) then checked `info registers` `rdx` is right value. Can return value 2 be related with [linux behaviour note 6?](https://man7.org/linux/man-pages/man2/syscall.2.html#NOTES) It's only used in spectific archs/spectific system calls. – fsdfhdsjkhfjkds Mar 31 '21 at 14:03

2 Answers2

6

Optimization level determines where clang decides to keep its loop counter: in memory (unoptimized) or in a register, in this case r8d (optimized). R8D is a logical choice for the compiler: it's a call-clobbered reg it can use without saving at the start/end of main, and you've told it all the registers it could use without a REX prefix (like ECX) are either inputs / outputs or clobbers for the asm statement.

Note: if FreeBSD is like MacOS, system call error / no-error status is returned in CF (the carry flag), not via RAX being in the -4095..-1 range. In that case, you'd want a GCC6 flag-output operand like "=@ccc" (err) for int err(#ifdef __GCC_ASM_FLAG_OUTPUTS__ - example) or a setc %cl in the template to materialize a boolean manually. (CL is a good choice because you can just use it as an output instead of a clobber.)


FreeBSD's syscall handling trashes R8, R9, and R10, in addition to the bare minimum clobbering the Linux does: RAX (retval) and RCX / R11 (The syscall instruction itself uses them to save RIP / RFLAGS so the kernel can find its way back to user-space, so the kernel never even sees the original values.)

Possibly also RDX, we're not sure; the comments call it "return value 2" (i.e. as part of a RDX:RAX return value?). We also don't know what future-proof ABI guarantees FreeBSD intends to maintain in future kernels.

You can't assume R8-R10 are zero after syscall because they're actually preserved instead of zeroed when tracing / single-stepping. (Because then the kernel chooses not to return via sysret, for the same reason as Linux: hardware / design bugs make that unsafe if registers might have been modified by ptrace while inside the system call. e.g. attempting to sysret with a non-canonical RIP will #GP in ring 0 (kernel mode) on Intel CPUs! That's a disaster because RSP = user stack at that point.)


The relevant kernel code is the sysret path (well spotted by @NateEldredge; I found the syscall entry point by searching for swapgs, but hadn't gotten to looking at the return path).

The function-call-preserved registers don't need to be restored by that code because calling a C function didn't destroy them in the first place. and the code does restore the function-call-clobbered "legacy" registers RDI, RSI, and RDX.

R8-R11 are the registers that are call-clobbered in the function-calling convention, and that are outside the original 8 x86 registers. So that's what makes them "special". (R11 doesn't get zeroed; syscall/sysret uses it for RFLAGS, so that's the value you'll find there after syscall)

Zeroing is slightly faster than loading them, and in the normal case (syscall instruction inside a libc wrapper function) you're about to return to a caller that's only assuming the function-calling convention, and thus will assume that R8-R11 are trashed (same for RDI, RSI, RDX, and RCX, although FreeBSD does bother to restore those for some reason.)


This zeroing only happens when not single-stepping or tracing (e.g. truss or GDB si). The syscall entry point into an amd64 kernel (Github) does save all the incoming registers, so they're available to be restored by other ways out of the kernel.


Updated asm() wrapper

// Should be fixed for FreeBSD, plus other improvements
ssize_t sys_write(int fd, const void *data, size_t size){
    register ssize_t res __asm__("rax");
    register int arg0 __asm__("edi") = fd;
    register const void *arg1 __asm__("rsi") = data;  // you can use real types
    register size_t arg2 __asm__("rdx") = size;
    __asm__ __volatile__(
        "syscall"
                    // RDX *maybe* clobbered
        : "=a" (res), "+r" (arg2)
                           // RDI, RSI preserved
        : "a" (SYS_write), "r" (arg0), "r" (arg1)
          // An arg in R10, R8, or R9 definitely would be
        : "rcx", "r11", "memory", "r8", "r9", "r10"   ////// The fix: r8-r10
         // see below for a version that avoids the "memory" clobber with a dummy input operand
    );
    return res;
}

Use "+r" output/input operands with any args that need register long arg3 asm("r10") or similar for r8 or r9.

This is inside a wrapper function so the modified value of the C variables get thrown away, forcing repeated calls to set up the args every time. That would be the "defensive" approach until another answer identifies more definitely-non-trashed registers.


I did break *0x000000000020192b then info registers when break happened. r8 is zero. Program still gets stuck in this case

I assume that r8 wasn't zero before you did that GDB continue across the syscall instruction. Yes, that test confirms that the FreeBSD kernel is trashing r8 when not single-stepping. (And behaving in a way that matches what we see in the source code.)


Note that you can tell the compiler that a write system call only reads memory (not writes) using a dummy "m" input operand instead of a "memory" clobber. That would let it hoist the store of c out of the loop. (How can I indicate that the memory *pointed* to by an inline ASM argument may be used?)

i.e. "m"(*(const char (*)[size]) data) as an input instead of a "memory" clobber.

If you're going to write specific wrappers for each syscall you use, instead of a generic wrapper you use for every 3-operand syscall that just casts all operands to unsigned long, this is the advantage you can get from doing that.

Speaking of which, there's absolutely no point in making your syscall args all be long; making user-space sign-extend int fd into a 64-bit register is just wasted instructions. The kernel ABI will (almost certainly) ignore the high bytes of registers for narrow args, like Linux does. (Again, unless you're making a generic syscall3 wrapper that you just use with different SYS_ numbers to define write, read, and other 3-operand system calls; then you would cast everything to register-width and just use a "memory" clobber).

I made these changes for my modified version below.

Also note that for RDI, RSI, and RDX, there are specific-register letter constraints which you can use instead of register-asm locals, just like you're doing for the return value in RAX ("=a"). BTW, you don't really need a matching constraint for the call number, just use an "a" input; it's easier to read because you don't need to look at another operand to check that you're matching the right output.

// assuming RDX *is* clobbered.
// could remove the + if it isn't.
ssize_t sys_write(int fd, const void *data, size_t size)
{
    // register long arg3 __asm__("r10") = ??;
    // register-asm is useful for R8 and up

    ssize_t res;
    __asm__ __volatile__("syscall"
                    // RDX
        : "=a" (res), "+d" (size)
         //  EAX/RAX       RDI       RSI
        : "a" (SYS_write), "D" (fd), "S" (data),
          "m" (*(const char (*)[size]) data) // tells compiler this mem is an input
        : "rcx", "r11"    //, "memory"
#ifndef __linux__
              , "r8", "r9", "r10"   // Linux always restores these
#endif
    );
    return res;
}

Some people prefer register ... asm("") for all the operands because you get to use the full register name, and don't have to remember the totally-non-obvious "D" for RDI/EDI/DI/DIL vs. "d" for RDX/EDX/DX/DL

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • About `"=@ccc" (err)` part can't compiler optimize code and store err variable at register `cl`? – fsdfhdsjkhfjkds Apr 03 '21 at 14:46
  • 1
    @fsdfhdsjkhfjkds: It could emit a `setc cl` if it wanted to, yes. But if you do `if(err)`, it can just `jc` or `jnc` directly, without ever materializing the boolean variable in a register with a useless `setc cl` / `test cl,cl` sequence. i.e. that would be a pessimization for most use-cases. – Peter Cordes Apr 03 '21 at 17:25
1

Here's a test framework to work with. It is [loosely] modeled on a H/W logic analyzer and/or things like dtrace.

It will save registers before and after the syscall instruction in a large global buffer.

After the loop terminates it will dump out a trace of all the register values that were stored.


It is multiple files. To extract:

  1. save the code below to a file (e.g. /tmp/archive).
  2. Create a directory: (e.g.) /tmp/extract
  3. cd to /tmp/extract.
  4. Then do: perl /tmp/archive -go.
  5. It will create some subdirectories: /tmp/extract/syscall and /tmp/extract/snaplib and store a few files there.
  6. cd to the program target directory (e.g.) cd /tmp/extract/syscall
  7. build with: make
  8. Then, run with: ./syscall

Here is the file:

Edit: I've added a check for overflow of the snaplist buffer in the snapnow function. If the buffer is full, dumpall is called automatically. This is good in general but also necessary if the loop in main never terminates (i.e. without the check the post loop dump would never occur)

Edit: And, I've added optional "x86_64 red zone" support

#!/usr/bin/perl
# FILE: ovcbin/ovcext.pm 755
# ovcbin/ovcext.pm -- ovrcat archive extractor
#
# this is a self extracting archive
# after the __DATA__ line, files are separated by:
#   % filename

ovcext_cmd(@ARGV);
exit(0);

sub ovcext_cmd
{
    my(@argv) = @_;
    local($xfdata);
    local($xfdiv,$divcur,%ovcdiv_lookup);

    $pgmtail = "ovcext";
    ovcinit();
    ovcopt(\@argv,qw(opt_go opt_f opt_t));

    $xfdata = "ovrcat::DATA";
    $xfdata = \*$xfdata;

    ovceval($xfdata);

    ovcfifo($zipflg_all);

    ovcline($xfdata);

    $code = ovcwait();

    ovcclose(\$xfdata);

    ovcdiv();

    ovczipd_spl()
        if ($zipflg_spl);
}

sub ovceval
{
    my($xfdata) = @_;
    my($buf,$err);

    {
        $buf = <$xfdata>;
        chomp($buf);

        last unless ($buf =~ s/^%\s+([\@\$;])/$1/);

        eval($buf);

        $err = $@;
        unless ($err) {
            undef($buf);
            last;
        }

        chomp($err);
        $err = " (" . $err . ")"
    }

    sysfault("ovceval: bad options line -- '%s'%s\n",$buf,$err)
        if (defined($buf));
}

sub ovcline
{
    my($xfdata) = @_;
    my($buf);
    my($tail);

    while ($buf = <$xfdata>) {
        chomp($buf);

        if ($buf =~ /^%\s+(.+)$/) {
            $tail = $1;
            ovcdiv($tail);
            next;
        }

        print($xfdiv $buf,"\n")
            if (ref($xfdiv));
    }

}

sub ovcdiv
{
    my($ofile) = @_;
    my($mode);
    my($xfcur);
    my($err,$prt);

    ($ofile,$mode) = split(" ",$ofile);

    $mode = oct($mode);
    $mode &= 0777;

    {
        unless (defined($ofile)) {
            while ((undef,$divcur) = each(%ovcdiv_lookup)) {
                close($divcur->{div_xfdst});
            }
            last;
        }

        $ofile = ovctail($ofile);

        $divcur = $ovcdiv_lookup{$ofile};
        if (ref($divcur)) {
            $xfdiv = $divcur->{div_xfdst};
            last;
        }
        undef($xfdiv);

        if (-e $ofile) {
            msg("ovcdiv: file '%s' already exists -- ",$ofile);

            unless ($opt_f) {
                msg("rerun with -f to force\n");
                last;
            }

            msg("overwriting!\n");
        }

        unless (defined($err)) {
            ovcmkdir($1)
                if ($ofile =~ m,^(.+)/[^/]+$,);
        }

        msg("$pgmtail: %s %s",ovcnogo("extracting"),$ofile);
        msg(" chmod %3.3o",$mode)
            if ($mode);
        msg("\n");

        last unless ($opt_go);
        last if (defined($err));

        $xfcur = ovcopen(">$ofile");

        $divcur = {};
        $ovcdiv_lookup{$ofile} = $divcur;

        if ($mode) {
            chmod($mode,$xfcur);
            $divcur->{div_mode} = $mode;
        }

        $divcur->{div_xfdst} = $xfcur;
        $xfdiv = $xfcur;
    }
}

sub ovcinit
{

    {
        last if (defined($ztmp));
        $ztmp = "/tmp/ovrcat_zip";

        $PWD = $ENV{PWD};

        $quo_2 = '"';

        $ztmp_inp = $ztmp . "_0";
        $ztmp_out = $ztmp . "_1";
        $ztmp_perl = $ztmp . "_perl";

        ovcunlink();

        $ovcdbg = ($ENV{"ZPXHOWOVC"} != 0);
    }
}

sub ovcunlink
{

    _ovcunlink($ztmp_inp,1);
    _ovcunlink($ztmp_out,1);
    _ovcunlink($ztmp_perl,($pgmtail ne "ovcext") || $opt_go);
}

sub _ovcunlink
{
    my($file,$rmflg) = @_;
    my($found,$tag);

    {
        last unless (defined($file));

        $found = (-e $file);

        $tag //= "notfound"
            unless ($found);
        $tag //= $rmflg ? "cleaning" : "keeping";

        msg("ovcunlink: %s %s ...\n",$tag,$file)
            if (($found or $ovcdbg) and (! $ovcunlink_quiet));

        unlink($file)
            if ($rmflg and $found);
    }
}

sub ovcopt
{
    my($argv) = @_;
    my($opt);

    while (1) {
        $opt = $argv->[0];
        last unless ($opt =~ s/^-/opt_/);

        shift(@$argv);
        $$opt = 1;
    }
}

sub ovctail
{
    my($file,$sub) = @_;
    my(@file);

    $file =~ s,^/,,;
    @file = split("/",$file);

    $sub //= 2;

    @file = splice(@file,-$sub)
        if (@file >= $sub);

    $file = join("/",@file);

    $file;
}

sub ovcmkdir
{
    my($odir) = @_;
    my(@lhs,@rhs);

    @rhs = split("/",$odir);

    foreach $rhs (@rhs) {
        push(@lhs,$rhs);

        $odir = join("/",@lhs);

        if ($opt_go) {
            next if (-d $odir);
        }
        else {
            next if ($ovcmkdir{$odir});
            $ovcmkdir{$odir} = 1;
        }

        msg("$pgmtail: %s %s ...\n",ovcnogo("mkdir"),$odir);

        next unless ($opt_go);

        mkdir($odir) or
            sysfault("$pgmtail: unable to mkdir '%s' -- $!\n",$odir);
    }
}

sub ovcopen
{
    my($file,$who) = @_;
    my($xf);

    $who //= $pgmtail;
    $who //= "ovcopen";

    open($xf,$file) or
        sysfault("$who: unable to open '%s' -- $!\n",$file);

    $xf;
}

sub ovcclose
{
    my($xfp) = @_;
    my($ref);
    my($xf);

    {
        $ref = ref($xfp);
        last unless ($ref);

        if ($ref eq "GLOB") {
            close($xfp);
            last;
        }

        if ($ref eq "REF") {
            $xf = $$xfp;
            if (ref($xf) eq "GLOB") {
                close($xf);
                undef($$xfp);
            }
        }
    }

    undef($xf);

    $xf;
}

sub ovcnogo
{
    my($str) = @_;

    unless ($opt_go) {
        $str = "NOGO-$str";
        $nogo_msg = 1;
    }

    $str;
}

sub ovcdbg
{

    if ($ovcdbg) {
        printf(STDERR @_);
    }
}

sub msg
{

    printf(STDERR @_);
}

sub msgv
{

    $_ = join(" ",@_);
    print(STDERR $_,"\n");
}

sub sysfault
{

    printf(STDERR @_);
    exit(1);
}

sub ovcfifo
{
}

sub ovcwait
{
    my($code);

    if ($pid_fifo) {
        waitpid($pid_fifo,0);
        $code = $? >> 8;
    }

    $code;
}

sub prtstr
{
    my($val,$fmtpos,$fmtneg) = @_;

    {
        unless (defined($val)) {
            $val = "undef";
            last;
        }

        if (ref($val)) {
            $val = sprintf("(%s)",$val);
            last;
        }

        $fmtpos //= "'%s'";

        if (defined($fmtneg) && ($val <= 0)) {
            $val = sprintf($fmtneg,$val);
            last;
        }

        $val = sprintf($fmtpos,$val);
    }

    $val;
}

sub prtnum
{
    my($val) = @_;

    $val = prtstr($val,"%d");

    $val;
}

END {
    msg("$pgmtail: rerun with -go to actually do it\n")
        if ($nogo_msg);
    ovcunlink();
}

1;
package ovrcat;
__DATA__
% ;
% syscall/syscall.c
/* for SYS_* constants */
#include <sys/syscall.h>

/* for types like size_t */
#include <unistd.h>

#include <snaplib/snaplib.h>

ssize_t
my_write(int fd, const void *data, size_t size)
{
    register long res __asm__("rax");
    register long arg0 __asm__("rdi") = fd;
    register long arg1 __asm__("rsi") = (long)data;
    register long arg2 __asm__("rdx") = size;

    __asm__ __volatile__(
        SNAPNOW
        "\tsyscall\n"
        SNAPNOW
        : "=r" (res)
        : "0" (SYS_write), "r" (arg0), "r" (arg1), "r" (arg2)
        : "rcx", "r11", "memory"
    );

    return res;
}

int
main(void)
{

    for (int i = 0; i < 8000; i++) {
        char a = 0;
        int some_invalid_fd = -1;
        my_write(some_invalid_fd, &a, 1);
    }

    snapreg_dumpall();

    return 0;
}
% snaplib/snaplib.h
// snaplib/snaplib.h -- register save/dump

#ifndef _snaplib_snaplib_h_
#define _snaplib_snaplib_h_

#ifdef _SNAPLIB_GLO_
#define EXTRN_SNAPLIB       /**/
#else
#define EXTRN_SNAPLIB       extern
#endif

#ifdef RED_ZONE
#define SNAPNOW \
    "\tsubq\t$128,%%rsp\n" \
    "\tcall\tsnapreg\n" \
    "\taddq\t$128,%%rsp\n"
#else
#define SNAPNOW     "\tcall\tsnapreg\n"
#endif

typedef unsigned long reg_t;

#ifndef SNAPREG
#define SNAPREG     (1500 * 2)
#endif

typedef struct {
    reg_t snap_regs[16];
} __attribute__((packed)) snapreg_t;
typedef snapreg_t *snapreg_p;

EXTRN_SNAPLIB snapreg_t snaplist[SNAPREG];

#ifdef _SNAPLIB_GLO_
snapreg_p snapcur = &snaplist[0];
snapreg_p snapend = &snaplist[SNAPREG];
#else
extern snapreg_p snapcur;
extern snapreg_p snapend;
#endif

#include <snaplib/snaplib.proto>

#include <snaplib/snapgen.h>

#endif
% snaplib/snapall.c
// snaplib/snapall.c -- dump routines

#define _SNAPLIB_GLO_
#include <snaplib/snaplib.h>
#include <stdio.h>
#include <stdlib.h>

void
snapreg_dumpall(void)
{
    snapreg_p cur = snaplist;
    snapreg_p endp = (snapreg_p) snapcur;

    int idx = 0;
    for (;  cur < endp;  ++cur, ++idx) {
        printf("\n");
        printf("%d:\n",idx);
        snapreg_dumpgen(cur);
    }

    snapcur = snaplist;
}

// snapreg_crash -- invoke dump and abort
void
snapreg_crash(void)
{

    snapreg_dumpall();
    exit(9);
}

// snapreg_dumpone -- dump single element
void
snapreg_dumpone(snapreg_p cur,int regidx,const char *regname)
{
    reg_t regval = cur->snap_regs[regidx];

    printf("  %3s %16.16lX %ld\n",regname,regval,regval);
}
% snaplib/snapreg.s
    .text
    .globl  snapreg
snapreg:
    push    %r14
    push    %r15
    movq    snapcur(%rip),%r15
    movq    %rax,0(%r15)
    movq    %rbx,8(%r15)
    movq    %rcx,16(%r15)
    movq    %rdx,24(%r15)
    movq    %rsi,32(%r15)
    movq    %rsi,40(%r15)
    movq    %rbp,48(%r15)
    movq    %rsp,56(%r15)
    movq    %r8,64(%r15)
    movq    %r9,72(%r15)
    movq    %r10,80(%r15)
    movq    %r11,88(%r15)
    movq    %r12,96(%r15)
    movq    %r13,104(%r15)
    movq    %r14,112(%r15)
    movq    0(%rsp),%r14
    movq    %r14,120(%r15)
    addq    $128,%r15
    movq    %r15,snapcur(%rip)
    cmpq    snapend(%rip),%r15
    jae     snapreg_crash
    pop %r15
    pop %r14
    ret
% snaplib/snapgen.h
#ifndef _snapreg_snapgen_h_
#define _snapreg_snapgen_h_
static inline void
snapreg_dumpgen(snapreg_p cur)
{
    snapreg_dumpone(cur,0,"rax");
    snapreg_dumpone(cur,1,"rbx");
    snapreg_dumpone(cur,2,"rcx");
    snapreg_dumpone(cur,3,"rdx");
    snapreg_dumpone(cur,5,"rsi");
    snapreg_dumpone(cur,5,"rsi");
    snapreg_dumpone(cur,6,"rbp");
    snapreg_dumpone(cur,7,"rsp");
    snapreg_dumpone(cur,8,"r8");
    snapreg_dumpone(cur,9,"r9");
    snapreg_dumpone(cur,10,"r10");
    snapreg_dumpone(cur,11,"r11");
    snapreg_dumpone(cur,12,"r12");
    snapreg_dumpone(cur,13,"r13");
    snapreg_dumpone(cur,14,"r14");
    snapreg_dumpone(cur,15,"r15");
}
#endif
% snaplib/snaplib.proto
// /home/cae/OBJ/ovrgen/snaplib/snaplib.proto -- prototypes

// FILE: /home/cae/preserve/ovrbnc/snaplib/snapall.c
// snaplib/snapall.c -- dump routines

    void
    snapreg_dumpall(void);

    // snapreg_crash -- invoke dump and abort
    void
    snapreg_crash(void);

    // snapreg_dumpone -- dump single element
    void
    snapreg_dumpone(snapreg_p cur,int regidx,const char *regname);
% syscall/Makefile
# /home/cae/preserve/ovrbnc/syscall -- makefile
PGMTGT += syscall
LIBSRC += ../snaplib/snapreg.s
LIBSRC += ../snaplib/snapall.c
ifndef COPTS
    COPTS += -O2
endif
CFLAGS += $(COPTS)
CFLAGS += -mno-red-zone
CFLAGS += -g
CFLAGS += -Wall
CFLAGS += -Werror
CFLAGS += -I..
all: $(PGMTGT)
syscall: syscall.c $(CURSRC) $(LIBSRC)
    cc -o syscall $(CFLAGS) syscall.c $(CURSRC) $(LIBSRC)
clean:
    rm -f $(PGMTGT)
Craig Estey
  • 30,627
  • 4
  • 24
  • 48
  • Note that `call` trashes the red-zone below RSP, and there's no way to tell the compiler about that. So you might want to put `add $-128, %rsp` / `sub $-128, %rsp` around the asm, although since `main` is non-leaf (even after function inlining) it won't actually have any vars in the red-zone even if you compiled without optimization. Also, you might consider setting all the regs to some known value to avoid coincidences of the kernel happening to zero one that was already zeroed, for example. But yeah, overall good plan to investigate exactly what FreeBSD does. – Peter Cordes Mar 31 '21 at 02:57
  • @PeterCordes I did this quickly and like I said, it's crude. But, I'm not sure the red zone applies here because of the nature of `my_write` [it is clear that it does _not_ use one]. There was a more pressing bug (see my `Edit:` above) because OP has an infinite loop. I'll keep red zone in mind for the "next rev" [and if I turn this into production grade code]. It's late here ... Time to sleep [or so my girlfriend says ...] – Craig Estey Mar 31 '21 at 03:44
  • Fortunately Nate spotted the key piece of code in the sysret return path. IDK what you mean by "my_write" not using the red-zone. It's a leaf function, so if you compiled it without optimization, the compiler *would* spill the incoming register args below RSP. Not that it needs them again after the `asm` statement. But anyway, that problem applies to anything `my_write` inlines into. Of course it's only instrumentation for an experiment, not something people in general would copy, so a comment is fine. – Peter Cordes Mar 31 '21 at 04:04
  • @PeterCordes Good eye [for Nate]. Who's gonna file the bug report? ;-) – Craig Estey Mar 31 '21 at 04:08
  • What bug? FreeBSD's syscall ABI is pretty clearly just intentionally different from Linux's, slightly optimized for calling through libc wrapper functions whose own callers will assume those registers are clobbered anyway. (xor-zero is cheaper and smaller than a load). The section of the x86-64 SysV ABI doc that describe's Linux's syscall ABI is not normative, just an example of what one OS does. – Peter Cordes Mar 31 '21 at 04:16
  • @PeterCordes I've added red zone support. The default is to now just compile with `-mno-red-zone` but it also supports full red zone with `-DRED_ZONE`. Note that your fix for red zone was incorrect. The `add/sub` instructions _must_ wrap `call snapreg` in the _caller_ (e.g. `my_write` here). Otherwise, the `call` instruction itself mashes the red zone with the return address it pushes. Doing the `add/sub` in the _callee_ is too late. It would be interesting to know if an `asm volatile` block disables red zone usage for a given function, particularly with a memory clobber. – Craig Estey Mar 31 '21 at 21:23
  • Oh, I didn't realize that asm statement was itself called manually from inline asm, so the compiler wouldn't know about the call. (If that's the case, compiler-generated code could clobber other call-clobbered registers). No, `asm volatile("..." ::: "memory")` does *not* make the function non-leaf for gcc/clang's choice to use the red zone. [How do I tell gcc that my inline assembly clobbers part of the stack?](https://stackoverflow.com/q/39160450) – Peter Cordes Mar 31 '21 at 21:32
  • I think you got the offset backwards. You want `rsp -= 128` before the call to go down into fresh new stack space *below* the red zone. Thus `add $-128, %%rsp`. (`-128` fits in an imm8, `sub $+128, %%rsp` doesn't. Not a coincidence that the red zone is 128 bytes.) Code-size optimization is why I didn't follow the normal pattern of `sub` to reserve stack space, `add` to release it, but you can do that if you want. – Peter Cordes Mar 31 '21 at 21:35
  • @PeterCordes Yea, I was fixing it while you were writing your comment--updated – Craig Estey Mar 31 '21 at 21:37
  • I was [almost] going to ask you about your use of the `-128` with reversed instructions before I posted the update [the bug aside], but I figured it was too much [to ask]. I've restructured my code so it's reusable [for me], so I'll probably do a bit more cleanup. I can easily add the `-128` in an update. – Craig Estey Mar 31 '21 at 21:45