3

I am in the process of creating a fiber threading system in C, following https://graphitemaster.github.io/fibers/ . I have a function to set and restore context, and what i am trying to accomplish is launching a function as a fiber with its own stack. Linux, x86_64 SysV ABI.

extern void restore_context(struct fiber_context*);
extern void create_context(struct fiber_context*);

void foo_fiber()
{
    printf("Called as a fiber");
    exit(0);
}

int main()
{
    const uint32_t stack_size = 4096 * 16;
    const uint32_t red_zone_abi = 128;

    char* stack = aligned_alloc(16, stack_size);
    char* sp = stack + stack_size - red_zone_abi;

    struct fiber_context c = {0};
    c.rip = (void*)foo_fiber;
    c.rsp = (void*)sp;

    restore_context(&c);
}

where restore_context code is as follows:

.type restore_context, @function
.global restore_context
restore_context:
  movq 8*0(%rdi), %r8

  # Load new stack pointer.
  movq 8*1(%rdi), %rsp

  # Load preserved registers.
  movq 8*2(%rdi), %rbx
  movq 8*3(%rdi), %rbp
  movq 8*4(%rdi), %r12
  movq 8*5(%rdi), %r13
  movq 8*6(%rdi), %r14
  movq 8*7(%rdi), %r15

  # Push RIP to stack for RET.
  pushq %r8

  xorl %eax, %eax
  ret

So basically i am creating a new stack on the heap, and since the stack growns downwards, i take the end address - 128 bytes of red zone (which is necessary in the ABI). What restore_context does is simply swap %rsp to my new stack, and push address of foo_fiber onto it and then ret's to jump into foo_fiber. (it also loads some registers from fiber_context structure, but it should not matter now).

From what im seeing in GDB, the program manages to properly jump to foo_fiber and into printf, and then it crashes in _vprintf_internal on movaps %xmm1, 0x10(%rsp).

|  0x7ffff7e2f389 <__vfprintf_internal+153>        movdqu (%rax),%xmm1                                                                                                                                                    │
│  0x7ffff7e2f38d <__vfprintf_internal+157>        movups %xmm1,0x128(%rsp)                                                                                                                                               │
│  0x7ffff7e2f395 <__vfprintf_internal+165>        mov    0x10(%rax),%rax                                                                                                                                                 │
│  >0x7ffff7e2f399 <__vfprintf_internal+169>       movaps %xmm1,0x10(%rsp)  

I find that extremely odd since it managed movups %xmm1, 0x128(%rsp) so a much higher offset from stack pointer. What is going on there?

If i change the code of foo_fiber to do something else, for example allocate and randomly fill char[100], it works.

I am kind of at loss about what is going on. At first i thought i might have alignment issues, since the vector xmm functions are crashing, so I changed malloc to aligned_alloc. The crash i am getting is a SIGSEGV, but 0x10

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
wild-ptr
  • 100
  • 6
  • 2
    You messed up stack alignment. `movups` is unaligned so it doesn't care that's why it worked. `movaps` requires alignment. You did not dump the registers but I bet `rsp` at the crash is not a multiple of 16. – Jester Feb 25 '22 at 00:11
  • I was thinking about that, since yes, the instruction requires proper alignment. But my stack was created using aligned_alloc to 16, and its length is also a multiple of 16, so both ends of the stack are 16 aligned according to C code which created it. Then red zone is subtracted, which is minus 128 bytes, so it should still be 16-aligned. And in the end, 0x10 is also a 16-aligned offset. But yes, somehow RSP is only 8 aligned. Where did i go wrong? – wild-ptr Feb 25 '22 at 00:30
  • 1
    Is `printf` fiber-safe? Maybe try substitute with low level syscall `write`, that's definitely thread safe. – Erik Eidt Feb 25 '22 at 00:35
  • 1
    Shouldnt everything be fiber safe? It is essentially a single thread running sequential code. – wild-ptr Feb 25 '22 at 00:37
  • didnt that final pushq end up with an 8 byte alignment – pm100 Feb 25 '22 at 00:53
  • 1
    `gdb` _can_ do asm level debug. `stepi` followed by `x/i $rip` helps. I usually create a command macro that does both. And, you can dump registers as well (`info registers`). You can also do (e.g.) `p/x $rsp`. – Craig Estey Feb 25 '22 at 00:54
  • I wonder if there's some c runtime initialization that's being skipped because it doesn't think you're calling any stream functions. Does it help if you put `printf("called from main");` in main? – David Wohlferd Feb 25 '22 at 01:56
  • @DavidWohlferd: Interesting guess, but not plausible for GCC on Linux. `printf` works even from hand-written asm. The compiler doesn't have to specially include anything; libc init functions will get run unless you write your own `_start` *and* statically link. glibc will get itself initialized even via dynamic-linker hooks so you can even `call printf` from `_start` if you link libc dynamically. (Like GCC always does.) – Peter Cordes Feb 25 '22 at 02:07
  • Your symptoms of faulting on `movaps` but not `movups` relative to the same `(%rsp)` point very strongly to RSP misalignment being the problem. We know RSP points to valid memory, although the aligned store is at a lower address so it's barely possible that you overflowed your 16-page stack and this is the first access to lower addresses in the stack frame. Check with a debugger to see if `x $rsp` is accessible, or just `p /x $rsp`. This looks exactly like [glibc scanf Segmentation faults when called from a function that doesn't align RSP](https://stackoverflow.com/q/51070716) – Peter Cordes Feb 25 '22 at 02:14
  • 2
    `sp = stack + stack_size - red_zone_abi;` doesn't really make sense. The red zone is *below* the current RSP, so there's no reason to leave 128 bytes unused at the top of your allocation. That's not your problem, but it's unnecessary. You could leave room for the `struct fiber_context` at the top of the stack allocation if you want to stash it in dynamically allocated storage so it can live beyond the current function, but that's a separate concern. – Peter Cordes Feb 25 '22 at 02:18
  • @PeterCordes so i thought, but decided to follow the article since i figured it wont do any harm and perhaps i misunderstood the red line concept. – wild-ptr Feb 25 '22 at 12:15
  • Ah, I hadn't looked at the article you linked in the question. Yeah, it has that mistake, too. Apparently the author didn't understand the red zone or stack alignment as well as they thought. (Or maybe they understood the concepts, but had a brain fart re: the red zone when writing the code. And for stack alignment didn't test and didn't think through the call vs. jump issue. I don't want to crap on the article without reading it, but that's two separate bugs in its code; I'd want to check for others before using, and double-check any surprising claims in the text.) – Peter Cordes Feb 25 '22 at 12:55

1 Answers1

5

Agree with comments: your stack alignment is incorrect.

It is true that the stack must be aligned to 16 bytes. However, the question is when? The normal rule is that the stack pointer must be a multiple of 16 at the site of a call instruction that calls an ABI-compliant function.

Well, you don't use a call instruction, but what that really means is that on entry to an ABI-compliant function, the stack pointer must be 8 less than a multiple of 16, or in other words an odd multiple of 8, since it assumes it was called with a call instruction that pushed an 8-byte return address. That is just the opposite of what your code does, and so the stack is misaligned for the rest of your program, which makes printf crash when it tries to use aligned move instructions.

You could subtract 8 from the sp computed in your C code.

Or, I'm not really sure why you go to the trouble of loading the destination address into a register, then pushing and ret, when an indirect jump or call would do. (Unless you are deliberately trying to fool the indirect branch predictor?) An indirect call will also kill the stack-alignment bird, by pushing the return address (even though it will never be used). So you could leave the rest of your code alone, and replace all the r8/ret stuff in restore_context with just

callq *(8*0)(%rdi)
Nate Eldredge
  • 48,811
  • 6
  • 54
  • 82
  • Ohh, thank you! Now i can see that call pushes another 8 bytes onto stack (the return address), making it 16 aligned again at the start of new function. I thought ret will automatically decrement the stack pointer by 8 and "consume" the return address making it 16 aligned again but it seems it is not the case, and the 8 byte push stays there, messing with stack alignment. – wild-ptr Feb 25 '22 at 09:39
  • 1
    @wild-ptr: `ret` **does** increment the stack pointer by 8 (not decrement, the stack grows down), but that's exactly the problem. After `movq 8*1(%rdi), %rsp` the stack pointer is a multiple of 16, since that's how you constructed `sp`. `push %r8` decrements by 8, and `ret` increments by 8. So on entry to `foo_fiber`, rsp is again a multiple of 16, which is wrong. Had you replaced the `ret` by `jmp`, it would have worked. – Nate Eldredge Feb 25 '22 at 15:07