5

I'm writing an eBPF kprobe that checks task UIDs, namely that the only permitted UID changes between calls to execve are those allowed by setuid(), seteuid() and setreuid() calls.

Since the probe checks all tasks, it uses an unrolled loop that iterates starting from init_task, and it has to use at most 1024 or 8192 branches, depending on kernel version.

My question is, how to implement a check that returns nonzero if there is an illegal change, defined by:

(new_ruid != old_euid && new_ruid != old_ruid) ||
(new_euid != old_euid && new_euid != old_ruid && new_euid != old_suid)

but without using branches (clang uses jumps to short-circuit the check if any of the expressions between && evaluates to true).

pchaigno
  • 11,313
  • 2
  • 29
  • 54
patraulea
  • 652
  • 2
  • 5
  • 26
  • Why do you want to avoid branches? Are you hitting complexity issues? – pchaigno Feb 25 '22 at 14:41
  • 2
    There is a limit of 1024 total branches per bpf kprobe, and using more than zero branches per task limits the number of tasks I can check. – patraulea Feb 25 '22 at 14:44
  • 1
    Perhaps use `&` and `|` to avoid short-circuiting? – Ian Abbott Feb 25 '22 at 14:49
  • @IanAbbott I think on x86 using `&` and `|` would use LAHF or similar instructions to copy ZF into a register, but the bpf instruction set does not have eflags AFAICT: https://github.com/iovisor/bpf-docs/blob/master/eBPF.md – patraulea Feb 25 '22 at 14:54
  • @patraulea perhaps add a minimal reproducible example, just a simple file with headers and one BPF function with one `if` branch containing the logic, so that we can compile and test that. – Marco Bonelli Feb 25 '22 at 15:00
  • 3
    In principle, you could do `((new_ruid - old_euid)*(new_ruid - old_ruid)) | ((new_euid - old_euid)*(new_euid - old_ruid)*(new_euid - old_suid))`. Or replace `-` with `^`. The problem, though, is that the multiplication may overflow. – Nate Eldredge Feb 25 '22 at 15:33

2 Answers2

6

You should be able to do this using bitwise OR, XOR, shifts and integer multiplication. I assume your variables are all __s32 or __u32, cast them to __u64 before proceeding to avoid problems (otherwise cast every operand of the multiplications below to __u64).

Clearly a != b can become a ^ b. The && is a bit trickier, but can be translated into a multiplication (where if any operand is 0 the result is 0). The first part of your condition then becomes:

// (new_ruid != old_euid && new_ruid != old_ruid)
__u64 x = (new_ruid ^ old_euid) * (new_ruid ^ old_euid);

However for the second part we have an overflow problem since there are 3 conditions. You can avoid it by "compressing" the result of the first two into the lower 32 bits, since you don't really care about the multiplication, just about its "truthiness":

// (new_euid != old_euid && new_euid != old_ruid && new_euid != old_suid)

__u64 y = (new_euid ^ old_euid) * (new_euid ^ old_ruid);
y = (y >> 32) | (y & 0xffffffff);
y *= (new_euid ^ old_suid);

And finally just OR the two parts for the result. You can also "compress" again to the lower 32 bits if you want a __u32:

__u64 res = x | y;
// or
__u64 tmp = x | y;
__u32 res = (tmp >> 32) | (tmp & 0xffffffff);

All of the above combined compiles without any branch for me regardless of optimization level.

Marco Bonelli
  • 63,369
  • 21
  • 118
  • 128
  • 1
    Thanks, I'll go via this multiplication route. I think you have to cast the xor results to u64 before doing the multiplication though, that's where the `r7 <<= 32; r7 >>= 32` opcodes come from. Compiling this with clang -O3: `__u64 f() { unsigned a = 0x80001000, b = 0x00001000, c = 0x00001000; __u64 ret = (a ^ b) * (a ^ c); return ret; }` results in `return 0;` without the casts. – patraulea Feb 25 '22 at 18:08
  • @patraulea you are most certainly right, the asm looks wrong, I forgot to declare the vars as `__u64` instead of `__u32` in my code. I'll remove the disassembly snippet and make it clearer. – Marco Bonelli Feb 25 '22 at 18:45
  • @patraulea: Yup, in C if you want widening multiplication of 32-bit inputs, you need something like `res = a * (uint64_t)b`, otherwise `u64 res = a * b;` is just doing a 32-bit truncating multiply and then widening that result. I assume eBPF is the same. – Peter Cordes Feb 26 '22 at 00:28
  • @PeterCordes yes, the language is C so the C standard still applies, it's just a different target architecture. – Marco Bonelli Feb 26 '22 at 23:24
3

Following the other answer, there are better reduction functions, than folding the top bits over low bits.

First starting with the original problem, the generated code is actually not that bad.

bool func0(uint64_t new_ruid,uint64_t old_euid, uint64_t old_ruid,
           uint64_t new_euid, uint64_t old_suid) {
   return
      (new_ruid != old_euid && new_ruid != old_ruid) ||
      (new_euid != old_euid && new_euid != old_ruid && new_euid != old_suid);
}

_Z5func0mmmmm:                          # @_Z5func0mmmmm
    cmpq    %rsi, %rdi
    je      .LBB0_2
    movb    $1, %al
    cmpq    %rdx, %rdi
    je      .LBB0_2
    retq
.LBB0_2:
    cmpq    %rsi, %rcx
    setne   %al
    cmpq    %rdx, %rcx
    setne   %dl
    andb    %al, %dl
    cmpq    %r8, %rcx
    setne   %al
    andb    %dl, %al
    retq

Only the first portion contains conditional branches, while the latter part is fully unrolled to set_conditionals.

Thus we have about three candidates for reduction:

uint32_t reduct1(uint64_t a, uint64_t b) {
    a ^= b;
    return (a >> 32) | (a & 0xffffffff);
}

    movq    %rdi, %rax
    xorq    %rsi, %rax
    movq    %rax, %rcx
    shrq    $32, %rcx
    orl     %ecx, %eax


uint8_t reduct2(uint64_t a, uint64_t b) {
   return __builtin_popcountl(a ^ b);
}
    xorq    %rsi, %rdi
    popcntq %rdi, %rax


uint8_t reduct3(uint64_t a, uint64_t b) {
    return a != b ? 1u : 0u;
}
    cmpq    %rsi, %rdi
    setne   %al

The first variant is quite bloated compared to the two other version, in addition of it suffering from the problem of the multiplicative overflow. The popcount variant returns values in range of 1-64 in case the input variables differ, meaning that the multiplicative overflow is solved.

However, the third variant is strength reduced to logical-and, which is way faster than multiplication -- and furthermore it allows a non-destructive comparison of the operands, which probably beats the popcount version (which OTOH has more opportunities for instruction level parallelism).

bool func( uint64_t new_ruid, uint64_t old_euid, uint64_t old_ruid,
uint64_t new_euid, uint64_t old_suid) {
    auto a = reduct3(new_ruid, old_euid);
    auto b = reduct3(new_ruid, old_ruid);
    auto c = reduct3(new_euid, old_euid);
    auto d = reduct3(new_euid, old_ruid);
    auto e = reduct3(new_euid, old_suid);
   return (a * b) || (c * d * e);
}
    cmpq    %rsi, %rdi
    setne   %r9b
    cmpq    %rdx, %rdi
    setne   %dil
    cmpq    %rsi, %rcx
    setne   %sil
    cmpq    %rdx, %rcx
    setne   %al
    cmpq    %r8, %rcx
    setne   %cl
    andb    %r9b, %dil
    andb    %sil, %al
    andb    %cl, %al
    orb     %dil, %al

The problem can be solved also in SSE2 domain, with careful selection of parameters:

   XMM0 = new_ruid | new_euid
   XMM1 = old_euid | old_euid
   XMM2 = old_ruid | old_ruid
   XMM3 = old_ruid | old_suid <- we compare new_ruid twice to same value

   XMM1 = _mm_cmpeq_epi64(XMM1, XMM0);
   XMM2 = _mm_cmpeq_epi64(XMM2, XMM0);
   XMM3 = _mm_cmpeq_epi64(XMM3, XMM0);
   XMM1 = _mm_or_si128(XMM1, XMM2);
   XMM1 = _mm_or_si128(XMM1, XMM3);

On SSE2 there's only ==, not !=, so we need to apply De-Morgan a few times, then combine left and right part, and movemask or movq the result out. (IDK if it's allowed to use SSE2 on Linux kernel, so this part may not fit the requirement.)

Aki Suihkonen
  • 19,144
  • 1
  • 36
  • 57
  • Pardon my c++/auto. That can be trivially replaced by `uint32_t`... – Aki Suihkonen Feb 26 '22 at 08:15
  • 2
    Pretty sure eBPF compilers don't support SIMD intrinsics; the code runs inside the kernel, I think. Interesting to answer the question for other problem domains, but just be aware of the use-case this question is actually asking for. – Peter Cordes Feb 26 '22 at 08:47
  • Yeah, as I understand it, the target architecture here isn't x86, but the [eBPF virtual machine](https://github.com/iovisor/bpf-docs/blob/master/eBPF.md). If it had conditional set instructions like x86 `setcc`, let alone any kind of SIMD, the problem would be easy to solve. But it doesn't, hence the need to abuse multiply and xor and so forth. – Nate Eldredge Feb 26 '22 at 16:45
  • 2
    This is a pretty good answer... however we are talking about eBPF here, which is not x86, but a different machine language :') (see [this example](https://godbolt.org/z/h5aMTvv48)). While eBPF bytecode can be JITed to x86 (and in fact Linux has a JIT for eBPF on x86) you cannot write an eBPF program as x86 directly or even compile it to x86. eBPF machine code is quite limited, so while my answer is indeed suboptimal for x86 or really any other real world CPU, you can't do much better than that for eBPF unfortunately. – Marco Bonelli Feb 26 '22 at 23:21