The problem occurs, because the GNU C library uses the exit_group
syscall, if it is available, in Linux instead of exit
, for the _exit()
function (see sysdeps/unix/sysv/linux/_exit.c
for verification), and as documented in the man 2 prctl
, the exit_group
syscall is not allowed by the strict seccomp filter.
Because the _exit()
function call occurs inside the C library, we cannot interpose it with our own version (that would just do the exit
syscall). (The normal process cleanup is done elsewhere; in Linux, the _exit()
function only does the final syscall that terminates the process.)
We could ask the GNU C library developers to use the exit_group
syscall in Linux only when there are more than one thread in the current process, but unfortunately, it would not be easy, and even if added right now, would take quite some time for the feature to be available on most Linux distributions.
Fortunately, we can ditch the default strict filter, and instead define our own. There is a small difference in behaviour: the apparent signal that kills the process will change from SIGKILL
to SIGSYS
. (The signal is not actually delivered, as the kernel does kill the process; only the apparent signal number that caused the process to die changes.)
Furthermore, this is not even that difficult. I did waste a bit of time looking into some GCC macro trickery that would make it trivial to manage the allowed syscalls' list, but I decided it would not be a good approach: the list of allowed syscalls should be carefully considered -- we only add exit_group()
compared to the strict filter, here! -- so making it a bit difficult is okay.
The following code, say example.c
, has been verified to work on a 4.4 kernel (should work on kernels 3.5 or later) on x86-64 (for both x86 and x86-64, i.e. 32-bit and 64-bit binaries). It should work on all Linux architectures, however, and it does not require or use the libseccomp library.
#define _GNU_SOURCE
#include <stdlib.h>
#include <stddef.h>
#include <sys/prctl.h>
#include <sys/syscall.h>
#include <linux/seccomp.h>
#include <linux/filter.h>
#include <stdio.h>
static const struct sock_filter strict_filter[] = {
BPF_STMT(BPF_LD | BPF_W | BPF_ABS, (offsetof (struct seccomp_data, nr))),
BPF_JUMP(BPF_JMP | BPF_JEQ, SYS_rt_sigreturn, 5, 0),
BPF_JUMP(BPF_JMP | BPF_JEQ, SYS_read, 4, 0),
BPF_JUMP(BPF_JMP | BPF_JEQ, SYS_write, 3, 0),
BPF_JUMP(BPF_JMP | BPF_JEQ, SYS_exit, 2, 0),
BPF_JUMP(BPF_JMP | BPF_JEQ, SYS_exit_group, 1, 0),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_KILL),
BPF_STMT(BPF_RET | BPF_K, SECCOMP_RET_ALLOW)
};
static const struct sock_fprog strict = {
.len = (unsigned short)( sizeof strict_filter / sizeof strict_filter[0] ),
.filter = (struct sock_filter *)strict_filter
};
int main(void)
{
/* To be able to set a custom filter, we need to set the "no new privs" flag.
The Documentation/prctl/no_new_privs.txt file in the Linux kernel
recommends this exact form: */
if (prctl(PR_SET_NO_NEW_PRIVS, 1, 0, 0, 0)) {
fprintf(stderr, "Cannot set no_new_privs: %m.\n");
return EXIT_FAILURE;
}
if (prctl(PR_SET_SECCOMP, SECCOMP_MODE_FILTER, &strict)) {
fprintf(stderr, "Cannot install seccomp filter: %m.\n");
return EXIT_FAILURE;
}
/* The seccomp filter is now active.
It differs from SECCOMP_SET_MODE_STRICT in two ways:
1. exit_group syscall is allowed; it just terminates the
process
2. Parent/reaper sees SIGSYS as the killing signal instead of
SIGKILL, if the process tries to do a syscall not in the
explicitly allowed list
*/
return EXIT_SUCCESS;
}
Compile using e.g.
gcc -Wall -O2 example.c -o example
and run using
./example
or under strace
to see the syscalls and library calls done;
strace ./example
The strict_filter
BPF program is really trivial. The first opcode loads the syscall number into the accumulator. The next five opcodes compare it to an acceptable syscall number, and if found, jump to the final opcode that allows the syscall. Otherwise the second-to-last opcode kills the process.
Note that although the documentation refers to sigreturn
being the allowed syscall, the actual name of the syscall in Linux is rt_sigreturn
. (sigreturn
was deprecated in favour of rt_sigreturn
ages ago.)
Furthermore, when the filter is installed, the opcodes are copied to kernel memory (see kernel/seccomp.c
in the Linux kernel sources), so it does not affect the filter in any way if the data is modified later. Having the structures static const
has zero security impact, in other words.
I used static
since there is no need for the symbols to be visible outside this compilation unit (or in a stripped binary), and const
to put the data into the read-only data section of the ELF binary.
The form of a BPF_JUMP(BPF_JMP | BPF_JEQ, nr, equals, differs)
is simple: the accumulator (the syscall number) is compared to nr
. If they are equal, then the next equals
opcodes are skipped. Otherwise, the next differs
opcodes are skipped.
Since the equals cases jump to the very final opcode, you can add new opcodes at the top (that is, just after the initial opcode), incrementing the equals skip count for each one.
Note that printf()
will not work after the seccomp filter is installed, because internally, the C library wants to do a fstat
syscall (on standard output), and a brk
syscall to allocate some memory for a buffer.