seccomp --- how to EXIT_SUCCESS?

Question

Ηow to EXIT_SUCCESS after strict mode seccomp is set. Is it the correct practice, to call syscall(SYS_exit, EXIT_SUCCESS); at the end of main?

#include <stdlib.h>
#include <unistd.h> 
#include <sys/prctl.h>     
#include <linux/seccomp.h> 
#include <sys/syscall.h>

int main(int argc, char **argv) {
  prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT);

  //return EXIT_SUCCESS; // does not work
  //_exit(EXIT_SUCCESS); // does not work
  // syscall(__NR_exit, EXIT_SUCCESS); // (EDIT) This works! Is this the ultimate answer and the right way to exit success from seccomp-ed programs?
  syscall(SYS_exit, EXIT_SUCCESS); // (EDIT) works; SYS_exit equals __NR_exit
}

// gcc seccomp.c -o seccomp && ./seccomp; echo "${?}" # I want 0

Can't you just return EXIT_SUCCESS? (Woops: never mind -- didn't look at your code closely enough.) — Steven, Oct 15 '15 at 20:20
It's very strange that `_exit(EXIT_SUCCESS)` doesn't work, as the manpage clearly states that, in strict seccomp mode, "The only system calls that the calling thread is permitted to make are read(2), write(2), _exit(2) (but not exit_group(2)), and sigreturn(2)." (where bracketed numbers are of course manual sections). — , Nov 06 '16 at 20:33
@user263688 I don't think the problem was what you posted as an answer (not the downvoter), I posted an answer, it would be nice if you would take a look! :) — gsamaras, Nov 07 '16 at 00:10

score 14 · Answer 1 · edited May 23 '17 at 12:26

As explained in eigenstate.org and in SECCOMP (2):

The only system calls that the calling thread is permitted to make are read(2), write(2), _exit(2) (but not exit_group(2)), and sigreturn(2). Other system calls result in the delivery of a SIGKILL signal.

As a result, one would expect _exit() to work, but it's a wrapper function that invokes exit_group(2) which is not allowed in strict mode ([1], [2]), thus the process gets killed.

It's even reported in exit(2) - Linux man page:

In glibc up to version 2.3, the _exit() wrapper function invoked the kernel system call of the same name. Since glibc 2.3, the wrapper function invokes exit_group(2), in order to terminate all of the threads in a process.

Same happens with the return statement, which should end up in killing your process, in the very similar manner with _exit().

Stracing the process will provide further confirmation (to allow this to show up, you have to not set PR_SET_SECCOMP; just comment prctl()) and I got similar output for both non-working cases:

linux12:/home/users/grad1459>gcc seccomp.c -o seccomp
linux12:/home/users/grad1459>strace ./seccomp
execve("./seccomp", ["./seccomp"], [/* 24 vars */]) = 0
brk(0)                                  = 0x8784000
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
mmap2(NULL, 8192, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb775f000
access("/etc/ld.so.preload", R_OK)      = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY|O_CLOEXEC) = 3
fstat64(3, {st_mode=S_IFREG|0644, st_size=97472, ...}) = 0
mmap2(NULL, 97472, PROT_READ, MAP_PRIVATE, 3, 0) = 0xb7747000
close(3)                                = 0
access("/etc/ld.so.nohwcap", F_OK)      = -1 ENOENT (No such file or directory)
open("/lib/i386-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3
read(3, "\177ELF\1\1\1\0\0\0\0\0\0\0\0\0\3\0\3\0\1\0\0\0\220\226\1\0004\0\0\0"..., 512) = 512
fstat64(3, {st_mode=S_IFREG|0755, st_size=1730024, ...}) = 0
mmap2(NULL, 1739484, PROT_READ|PROT_EXEC, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = 0xdd0000
mmap2(0xf73000, 12288, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1a3) = 0xf73000
mmap2(0xf76000, 10972, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0xf76000
close(3)                                = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7746000
set_thread_area({entry_number:-1 -> 6, base_addr:0xb7746900, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}) = 0
mprotect(0xf73000, 8192, PROT_READ)     = 0
mprotect(0x8049000, 4096, PROT_READ)    = 0
mprotect(0x16e000, 4096, PROT_READ)     = 0
munmap(0xb7747000, 97472)               = 0
exit_group(0)                           = ?
linux12:/home/users/grad1459>

As you can see, exit_group() is called, explaining everything!

Now as you correctly stated, "SYS_exit equals __NR_exit"; for example it's defined in mit.syscall.h:

#define SYS_exit __NR_exit

so the last two calls are equivalent, i.e. you can use the one you like, and the output should be this:

linux12:/home/users/grad1459>gcc seccomp.c -o seccomp && ./seccomp ; echo "${?}" 
0

PS

You could of course define a filter yourself and use:

prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, filter);

as explained in the eigenstate link, to allow _exit() (or, strictly speaking, exit_group(2)), but do that only if you really need to and know what you are doing.

Additionally, the reason why `return EXIT_SUCCESS;` also fails is the same: the GNU C library does an `exit_group()` in that case, too. I do have some freestanding C for x86-64 SYSV ABI that proves that the `exit` syscall works fine after the `prctl(PR_SET_SECCOMP, SECCOMP_MODE_STRICT)` call, if you're interested. — Nominal Animal, Nov 07 '16 at 00:51
Running the binary under `strace` (i.e. `strace ./example`) is enough to prove the C library uses the `exit_group` syscall instead of `exit`, though. — Nominal Animal, Nov 07 '16 at 00:57
Hey @NominalAnimal! :) Yeah, good idea, it was also mentioned in one of links, but wasn't displayed, updated! Do you like the answer now? — gsamaras, Nov 07 '16 at 01:29
Oh, I was happy enough with the answer as-is -- but, adding the `strace` does remove any possibility of doubt, allowing others to verify for themselves. I do like that. :) — Nominal Animal, Nov 07 '16 at 03:58
As to OP's stated question, I'd say GNU C library doing an `exit_group` syscall instead of an `exit` when there is only one thread in the process .. is a bug. I do not like the idea of bypassing library cleanup by calling the `exit` syscall directly. In other words, the OP *should* be able to just `return EXIT_SUCCESS;` or `exit(EXIT_SUCCESS);` without getting killed by a signal. Only a change to the C library internals will change that. Time to report a glibc bug, I'd say. — Nominal Animal, Nov 07 '16 at 04:04
I thought about it for a while, and decided that creating a custom filter (that allows `exit_group` syscall) makes more sense. For one, the code then works right now. I hope you don't mind me adding a separate answer, pursuing that tangent? — Nominal Animal, Nov 08 '16 at 22:41
Of course not @NominalAnimal, after all I alone mentioned the filter, glad you had time to expand on it, nice answer! — gsamaras, Nov 09 '16 at 12:11
Good! You see, I was thinking how a configurable seccomp filter might be useful for plugins running in a separate thread or process, and wanted to see how complicated it'd be. Not at all, it turns out. Efficient BPF generation is nontrivial, but definitely not difficult; simply a matter of experimentation and testing. The simplest filter generator would just build the tests in reverse order, least likely tested last, but be limited to 255 tests, and syscall number comparison only; the nontrivial part is avoiding those limits but staying efficient. — Nominal Animal, Nov 09 '16 at 17:44

score 9 · Answer 2 · answered Nov 08 '16 at 22:38

The problem occurs, because the GNU C library uses the exit_group syscall, if it is available, in Linux instead of exit, for the _exit() function (see sysdeps/unix/sysv/linux/_exit.c for verification), and as documented in the man 2 prctl, the exit_group syscall is not allowed by the strict seccomp filter.

Because the _exit() function call occurs inside the C library, we cannot interpose it with our own version (that would just do the exit syscall). (The normal process cleanup is done elsewhere; in Linux, the _exit() function only does the final syscall that terminates the process.)

We could ask the GNU C library developers to use the exit_group syscall in Linux only when there are more than one thread in the current process, but unfortunately, it would not be easy, and even if added right now, would take quite some time for the feature to be available on most Linux distributions.

Fortunately, we can ditch the default strict filter, and instead define our own. There is a small difference in behaviour: the apparent signal that kills the process will change from SIGKILL to SIGSYS. (The signal is not actually delivered, as the kernel does kill the process; only the apparent signal number that caused the process to die changes.)

Furthermore, this is not even that difficult. I did waste a bit of time looking into some GCC macro trickery that would make it trivial to manage the allowed syscalls' list, but I decided it would not be a good approach: the list of allowed syscalls should be carefully considered -- we only add exit_group() compared to the strict filter, here! -- so making it a bit difficult is okay.

The following code, say example.c, has been verified to work on a 4.4 kernel (should work on kernels 3.5 or later) on x86-64 (for both x86 and x86-64, i.e. 32-bit and 64-bit binaries). It should work on all Linux architectures, however, and it does not require or use the libseccomp library.

#define  _GNU_SOURCE
#include <stdlib.h>
#include <stddef.h>
#include <sys/prctl.h>
#include <sys/syscall.h>
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <stdio.h>

static const struct sock_filter  strict_filter[] = {
    BPF_STMT(BPF_LD | BPF_W | BPF_ABS, (offsetof (struct seccomp_data, nr))),

    BPF_JUMP(BPF_JMP | BPF_JEQ, SYS_rt_sigreturn, 5, 0),
    BPF_JUMP(BPF_JMP | BPF_JEQ, SYS_read,         4, 0),
    BPF_JUMP(BPF_JMP | BPF_JEQ, SYS_write,        3, 0),
    BPF_JUMP(BPF_JMP | BPF_JEQ, SYS_exit,         2, 0),
    BPF_JUMP(BPF_JMP | BPF_JEQ, SYS_exit_group,   1, 0),

    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),
    BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW)
};

static const struct sock_fprog  strict = {
    .len = (unsigned short)( sizeof strict_filter / sizeof strict_filter[0] ),
    .filter = (struct sock_filter *)strict_filter
};

int main(void)
{
    /* To be able to set a custom filter, we need to set the "no new privs" flag.
       The Documentation/prctl/no_new_privs.txt file in the Linux kernel
       recommends this exact form: */
    if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
        fprintf(stderr, "Cannot set no_new_privs: %m.\n");
        return EXIT_FAILURE;
    }
    if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &strict)) {
        fprintf(stderr, "Cannot install seccomp filter: %m.\n");
        return EXIT_FAILURE;
    }

    /* The seccomp filter is now active.
       It differs from SECCOMP_SET_MODE_STRICT in two ways:
         1. exit_group syscall is allowed; it just terminates the
            process
         2. Parent/reaper sees SIGSYS as the killing signal instead of
            SIGKILL, if the process tries to do a syscall not in the
            explicitly allowed list
    */

    return EXIT_SUCCESS;
}

Compile using e.g.

gcc -Wall -O2 example.c -o example

and run using

./example

or under strace to see the syscalls and library calls done;

strace ./example

The strict_filter BPF program is really trivial. The first opcode loads the syscall number into the accumulator. The next five opcodes compare it to an acceptable syscall number, and if found, jump to the final opcode that allows the syscall. Otherwise the second-to-last opcode kills the process.

Note that although the documentation refers to sigreturn being the allowed syscall, the actual name of the syscall in Linux is rt_sigreturn. (sigreturn was deprecated in favour of rt_sigreturn ages ago.)

Furthermore, when the filter is installed, the opcodes are copied to kernel memory (see kernel/seccomp.c in the Linux kernel sources), so it does not affect the filter in any way if the data is modified later. Having the structures static const has zero security impact, in other words.

I used static since there is no need for the symbols to be visible outside this compilation unit (or in a stripped binary), and const to put the data into the read-only data section of the ELF binary.

The form of a BPF_JUMP(BPF_JMP | BPF_JEQ, nr, equals, differs) is simple: the accumulator (the syscall number) is compared to nr. If they are equal, then the next equals opcodes are skipped. Otherwise, the next differs opcodes are skipped.

Since the equals cases jump to the very final opcode, you can add new opcodes at the top (that is, just after the initial opcode), incrementing the equals skip count for each one.

Note that printf() will not work after the seccomp filter is installed, because internally, the C library wants to do a fstat syscall (on standard output), and a brk syscall to allocate some memory for a buffer.

This is one of those moments where I'd want to split the bounty. — , Nov 10 '16 at 21:47
@Rhymoid the bounty is like an atom! :) Don't worry though, Nomimal is a cool guy! — gsamaras, Nov 12 '16 at 14:30
This is *NOT SECURE*, it is necessary to check the architecture number. If you compile this for x86-64 then in addition x86-64 syscall 15 (rt_sigreturn) you have also allowed x86-32 syscall 15 which is chmod. Would you like your home directory world writeable and a setuid web browser cache? — Timothy Baldwin, Sep 10 '19 at 19:39

seccomp --- how to EXIT_SUCCESS?

2 Answers2

Linked