Inline asm fails to compile without optimization

Question

I need the futex syscall in a 32-bit Linux process, but cannot use the syscall function (header is unavailable). That could still be done by using inline asm, as follows:

#include <time.h>

#define SYS_futex 0xf0

// We need -fomit-frame-pointer in order to set EBP
__attribute__((optimize("-fomit-frame-pointer")))
int futex(int* uaddr, int futex_op, int val, const struct timespec* timeout, int* uaddr2, int val3)
{
    register int ebp asm ("ebp") = val3;
    int result;
    asm volatile("int $0x80"
                 : "=a"(result)
                 : "a"(SYS_futex), "b"(uaddr), "c"(futex_op), "d"(val), "S"(timeout), "D"(uaddr2), "r"(ebp)
                // : "memory"  // would make this safe, but could cause some unnecessary spills.  THIS VERSION IS UNSAFE ON PURPOSE, DO NOT USE.
          );
        
    if (result < 0)
    {
        // Error handling
        return -1;
    }
    return result;
}

That compiles, as expected.

However, since we haven't specified the memory locations that may be read and/or written to, it could cause some sneaky bugs. So instead, we can use dummy memory inputs and outputs (How can I indicate that the memory *pointed* to by an inline ASM argument may be used?)

asm volatile("int $0x80"
             : "=a"(result), "+m"(uaddr2)
             : "a"(SYS_futex), "b"(uaddr), "c"(futex_op), "d"(val), "S"(timeout), "D"(uaddr2), "r"(ebp), "m"(*uaddr), "m"(*timeout));

When compiled with gcc -m32, it fails with 'asm' operand has impossible constraints. When compiled with clang -fomit-frame-pointer -m32, it fails with inline assembly requires more registers than available. I don't see why, though.

But, when compiled with -O1 -m32 (or any level other than -O0), it compiles fine.

I see two obvious solutions:

Use the "memory" clobber instead, which may be too restrictive, stopping the compiler from keeping unrelated variables in registers
Use __attribute__((optimize("-O3"))), which I'd like to avoid

Is there any other solution?

@NateEldredge I forgot to say, the first version compiles fine. The problem appears when I make the change in the second code block. https://godbolt.org/z/4Ko1eY — LHLaurini, Jul 21 '20 at 21:17
Quick note: I think the `futex` call can write the futex word, so don't you also need `*uaddr` as an output operand? — Nate Eldredge, Jul 21 '20 at 21:49
@NateEldredge AFAIK only `FUTEX_WAKE_OP` can do that, but it writes to `*uaddr2`, which is already listed as an output. I'll give the man page another look. — LHLaurini, Jul 21 '20 at 22:19
@NateEldredge Seems only `FUTEX_WAKE_OP` and PI-futex operations modify `*uaddr2`. I don't intend to use these, but I may add it as an output anyway, just in case. — LHLaurini, Jul 21 '20 at 22:26
Typical use of `futex` requires a memory barrier so that accesses to variables protected by the futex don't move past the futex, therefore there is little to be gained in avoiding a `"memory"` clobber. — Timothy Baldwin, Jan 23 '23 at 11:09

Nate Eldredge · Accepted Answer · 2020-07-21T21:38:04.123

The compiler doesn't know that you don't actually use the *uaddr and *timeout operands, so it still has to decide what %9 and %10 should expand to if you were to use them. The addresses of those objects were passed as parameters, so it can't generate a direct memory reference; it has to be indirect, and this means registers need to be allocated to store those addresses; for instance, the compiler could try to load the pointer uaddr into ecx and then expand %9 to (%ecx). Unfortunately, you have already claimed all the machine's registers for your other operands, so there are no registers available for this purpose.

With optimization on, the compiler is smart enough to figure out that the pointer uaddr is already available in ebx, so it can expand %9 to (%ebx) and likewise %10 to (%esi). Then it doesn't need any additional registers and everything is fine.

You can see this happening if you actually mention %9 and %10 in the inline asm, as in this example. With optimization on, it does as I said. Without optimization, it fails to compile as you know, but if we drop a couple of the other operands to free up some registers (here ecx and edx), we see that it is now expanding %7, %8 (they got renumbered) to (%edx), (%ecx), and loading those registers accordingly ahead of time. It doesn't know that this is redundant because edx and ebx both contain the same value.

I don't think there's any good way to avoid this except the ideas you already have: enable optimization, or use the "memory" clobber. I doubt the "memory" clobber will actually affect the generated code in such a short function, and anyway, if you're compiling without optimizations, you've kind of abandoned any hope of efficient code already. Alternatively, just write the entire function in assembly.

It strikes me it'd be nice if gcc/clang had an "implicit" constraint, saying that the asm will read/write the referenced object, but will figure out for itself how to do that and the compiler doesn't need to worry about generating the reference. — Nate Eldredge, Jul 21 '20 at 22:10

Inline asm fails to compile without optimization

1 Answers1

Linked