Is it possible to temporarily suppress Intel CET for a single ret instruction, or otherwise use retpolines with it?

Question

Intel CET (control-flow enforcement technology) consists of two pieces: SS (shadow stack) and IBT (indirect branch tracking). If you need to indirectly branch to somewhere that you can't put an endbr64 for some reason, you can suppress IBT for a single jmp or call instruction with notrack. Is there an equivalent way to suppress SS for a single ret instruction?

For context, I'm thinking about how this will interact with retpolines, which the key control flow of goes more-or-less like push real_target; call retpoline; pop junk; ret. If there's not a way to suppress SS for that ret, then is there some other way for retpolines to work when CET is enabled? If not, what options will we have? Will we need to maintain two sets of binary packages for everything, one for old CPUs that need retpolines, and one for new CPUs that support CET? And what about if Intel turns out to be wrong, and we do end up still needing retpolines on their new CPUs? Will we have to abandon CET to use them?

score 0 · Accepted Answer · answered Jun 17 '20 at 21:53

After playing with the assembly for a bit, I discovered that you can use retpolines with CET, but it's less than ideal. Here's how. For reference, consider this C code:

extern void (*fp)(void);

int f(void) {
    fp();
    return 0;
}

Compiling it with gcc -mindirect-branch=thunk -mfunction-return=thunk -O3 yields this:

f:
        subq    $8, %rsp
        movq    fp(%rip), %rax
        call    __x86_indirect_thunk_rax
        xorl    %eax, %eax
        addq    $8, %rsp
        jmp     __x86_return_thunk
__x86_return_thunk:
        call    .LIND1
.LIND0:
        pause
        lfence
        jmp     .LIND0
.LIND1:
        lea     8(%rsp), %rsp
        ret
__x86_indirect_thunk_rax:
        call    .LIND3
.LIND2:
        pause
        lfence
        jmp     .LIND2
.LIND3:
        mov     %rax, (%rsp)
        ret

It turns out you can make this work just by modifying the thunks to look like this:

__x86_return_thunk:
        call    .LIND1
.LIND0:
        pause
        lfence
        jmp     .LIND0
.LIND1:
        push    %rdi
        movl    $1, %edi
        incsspq %rdi
        pop     %rdi
        lea     8(%rsp), %rsp
        ret

__x86_indirect_thunk_rax:
        call    .LIND3
.LIND2:
        pause
        lfence
        jmp     .LIND2
.LIND3:
        push    %rdi
        rdsspq  %rdi
        wrssq   %rax, (%rdi)
        pop     %rdi
        mov     %rax, (%rsp)
        ret

By using the incsspq, rdsspq, and wrssq instructions, you can modify the shadow stack to match your changes to the real stack. I tested those modified thunks with Intel SDE, and they indeed made the control flow errors go away.

That was the good news. Here's the bad news:

Unlike endbr64, the CET instructions I used in the thunks aren't NOPs on CPUs that don't support CET (they result in SIGILL). This means you'd need two different sets of thunks, and you'd need to use CPU dispatch to pick the right ones depending on whether CET is available.
Using retpolines at all means that you're no longer doing any indirect branches, so while you'll still get the benefit of SS, you've completely negated IBT. I suppose you could work around this by making __x86_indirect_thunk_rax check for the presence of the endbr64 instruction, but that's really inelegant and would probably be really slow.

R11 is call-clobbered and not used for arg passing; you should be able to use it instead of RDI for `incsspq` even in a transparent wrapper like those thunks. (One reason that x86-64 SysV *has* a call-clobbered reg that's not used to pass/return anything is exactly so that wrappers / thunks have a scratch reg, e.g. for lazy dynamic linking.) — Peter Cordes, Jun 17 '20 at 22:29
@PeterCordes I thought about that, but GCC [uses the same thunk](https://godbolt.org/z/yvK5cJ) for function calls as it does for [labels as values (`goto *x;`)](https://gcc.gnu.org/onlinedocs/gcc/Labels-as-Values.html), and I'm not 100% sure that it won't ever assume r11 is preserved through them. — Joseph Sible-Reinstate Monica, Jun 17 '20 at 22:51
Ah I see. I guess that's why GCC is using `lea` instead of `add` in its thunks, so it doesn't even clobber FLAGS. It does already step on the red zone, so it can't just use it arbitrarily for a `switch` in a leaf function, though. If you were modifying GCC to emit this, hopefully you could teach it that it clobbers R11. — Peter Cordes, Jun 17 '20 at 23:38
@PeterCordes It looks like GCC can use whatever register it wants to hold the address to jump to, [including `r11`](https://github.com/torvalds/linux/blob/v5.7/arch/x86/lib/retpoline.S#L43). And clang is even worse: [it *always* uses `r11`](https://releases.llvm.org/6.0.0/tools/clang/docs/ReleaseNotes.html). — Joseph Sible-Reinstate Monica, Jun 18 '20 at 00:31
Doesn't the Linux kernel do it's own inline asm, not relying on GCC? I didn't find a definition of `GENERATE_THUNK` on github, but [`arch/x86/include/asm/nospec-branch.h`](https://github.com/torvalds/linux/blob/08bf1a27c4c354b853fd81a79e953525bbcc8506/arch/x86/include/asm/nospec-branch.h) has a lot of manual inline asm. — Peter Cordes, Jun 18 '20 at 00:37
@PeterCordes The Linux kernel provides its own thunks and a few calls, but AFAIK the compiler still generates some of the calls to them, and the caller decides which register to use for the target address. (And `GENERATE_THUNK` is defined right above where it's used.) — Joseph Sible-Reinstate Monica, Jun 18 '20 at 00:38
Ah right. If they were purely using inline asm they'd just use a dummy `"=r"` output for the scratch reg, or a fixed choice, and wouldn't need to emit definitions for every possible choice. Since Linux is a freestanding kernel, it needs to define stuff that would normally be in libgcc that GCC can emit references to. — Peter Cordes, Jun 18 '20 at 00:41

Is it possible to temporarily suppress Intel CET for a single ret instruction, or otherwise use retpolines with it?

1 Answers1

Linked