Fast fibers/coroutines under x64 Windows

Question

So I have this coroutine API, extended by me, based on code I found here: https://the8bitpimp.wordpress.com/2014/10/21/coroutines-x64-and-visual-studio/

struct mcontext {
  U64 regs[8];
  U64 stack_pointer;
  U64 return_address;
  U64 coroutine_return_address;
};

struct costate {
   struct mcontext callee;
   struct mcontext caller;
   U32 state;
};

void coprepare(struct costate **token,
       void *stack, U64 stack_size, cofunc_t func); /* C code */
void coenter(struct costate *token, void *arg);     /* ASM code */
void coyield(struct costate *token);                /* ASM code */
int  coresume(struct costate *token);               /* ASM code, new */

I'm stuck on implementing coyield(). coyield() can be written in C, but it's the assembly that I'm having problems with. Here's what I got so far (MASM/VC++ syntax).

;;; function: void _yield(struct mcontext *callee, struct mcontext *caller)
;;; arg0(RCX): callee token
;;; arg2(RDX): caller token
_yield proc
    lea RBP, [RCX + 64 * 8]
    mov [RCX +  0], R15
    mov [RCX +  8], R14
    mov [RCX + 16], R13
    mov [RCX + 24], R12
    mov [RCX + 32], RSI
    mov [RCX + 40], RDI
    mov [RCX + 48], RBP
    mov [RCX + 56], RBX

    mov R11, RSP
    mov RSP, [RDX + 64]
    mov [RDX + 64], R11

    mov R15, [RDX + 0]
    mov R14, [RDX + 8]
    mov R13, [RDX + 16]
    mov R12, [RDX + 24]
    mov RSI, [RDX + 32]
    mov RDI, [RDX + 40]
    mov RBP, [RDX + 48]    
        mov RBX, [RDX + 56]

    ret
_yield endp

This is a straight forward adaption of 8bitpimp's code. What it doesn't do, if I understand this code correctly, is put mcontext->return_address and mcontext->coroutine_return_address on the stack to be popped by the ret. Also, is that fast? IIRC, it causes a mismatch on the return branch predictor found in modern x64 pieces.

You can't really avoid a mispredict on the `ret` if the task you're switching to called the context-switch function from a different call tree than the task you're switching from. Even the C compiler knew everything and understood context switches, the return-address predictor stack is inevitably stale (unless both tasks happen to have called `yield` from the same point). — Peter Cordes, May 20 '18 at 00:24
What if I used an indirect call instead of a ret. Does that help? — Jesse Lactin, May 21 '18 at 00:43
You mean `pop rcx` / `jmp rcx` instead of `ret`, an indirect branch / jump? (Not a call). Nope, doesn't help, it just unbalances the call/ret stack so even if the two tasks had the same call tree, the next up-to-15 `ret`s up the tree will mispredict. (Of course, if task switching is high up in the call tree, going down into some child functions will push some entries out of the predictor stack before they're used. Skylake will use the regular indirect branch predictor when the return-predictor underflows.) Check perf counters; mispredicted `ret` instructions will probably be *very* minor — Peter Cordes, May 21 '18 at 00:52
Not quite. I'd use a mov rax, [rdx + 64] / call *rax. I may have butchered the syntax, but I hope the intent is clear. — Jesse Lactin, May 21 '18 at 00:58
Huh? [rdx+64]` is where you stored the old task's RSP (which is point at a return address, but isn't a return address itself). If you meant `call [rax]` to do another level of indirection, that could work, but `call` pushes a return address. That breaks the stack when you go back to compiler-generated code. Your context-switch function has to look like an ordinary function that returns (eventually, with global variables maybe modified like normal for a non-inline function) and follows the ABI / calling convention. That means popping the return address off the stack. — Peter Cordes, May 21 '18 at 01:12
Anyway no, that's not going to help with anything. The return-address predictor stack is stale after a context switch. The penalty is maybe 10 to 20 cycles of stall on each mispredict, and you have at most 15 more mispredicts total, or fewer if the context switching happens high up in the call tree rather than deep. — Peter Cordes, May 21 '18 at 01:14
My use case is a multithreaded job system. Calls to coroutine API functions can happen at any point in the call stack, but only if they're in order. In particular, this coyield function gets called fairly often, so it needs to not unbalance the call/ret stack. Should i — Jesse Lactin, May 21 '18 at 01:57
Cannot edit my last comment. Sorry for the spam. I'm currently thinking of just doing a push mcontext::return address / ret to implement context switching. I'm thinking I want an API where I can pre allocate most everything upfront if I have the information I need to do that. — Jesse Lactin, May 21 '18 at 02:08
Your question is interesting, but also is confusing because you haven't shown all your code and you haven't really explained your extension at a high level. Furthermore, you are mixing a "it doens't work" question with a "how do I make it fast" question. Is my guess correct that you are trying to take the symmetric coroutine implementation (with only a `yield` call) you linked and make it into an assymetric one with `coresume` (used by the caller) and `coyield` (used within the coroutine)? How does the ASM `_yield` method you show related to the C prototypes above it? — BeeOnRope, May 21 '18 at 04:24
why you not use windows fibers ? `SwitchToFiber` in place `_yield`. also your code incorrect from windows view - you not set (despite must) correct stack limits in *TEB*. as result if exception happens or you try allocate space in stack - your program crash. also exist UMS Threads, and usual threads. about your implementation - at first absolute mandatory set correct `StackBase` and `StackLimit` in `NT_TIB`. at second - why you save registers in separate structure, but not say direct in thread(fiber) stack, before switch ? — RbMm, May 21 '18 at 10:49
@BeeOnRope _yield is the guts of the routine. It's basically a swapcontext/SwitchToFiber implementation. coenter, coyield, and coresume and a new one, coreturn, are all wrappers that set the state of the coroutine and return. It retrospect, my question could've been better phrased. It's also closer to Yossi Krenin's Coroutines in One Page of C [link](https://www.embeddedrelated.com/showarticle/455.php) — Jesse Lactin, May 21 '18 at 15:31
@JesseLactin - I would split your questions into two parts. You seem to have a question about how the stack address is pushed/saved in `_yield` and also whether this general approach is fast. They don't really belong together: I'm making an answer for the "is it fast" part, and maybe I can help with the other part, but you should divide it in two since a good answer for one really has nothing to do with a good answer for the other. — BeeOnRope, May 22 '18 at 22:54

BeeOnRope · Accepted Answer · 2018-05-23T02:57:57.293

This answers only addresses the "is it fast" part of the question.

Return Address Prediction

First, a brief description of the behavior of a typical return-address predictor.

Every time a call is made, the return address that is pushed on the actual stack is also stored inside a CPU structure called the return address buffer or something like that.
When a ret (return) is made, the CPU assumes the destination will be the address currently at the top of the return address buffer, and that entry from return address buffer is "popped".

The effect is to perfectly¹ predict call/ret pairs, as long as they occur in their usual properly nested pattern and that ret is actually removing the unmodified return address pushed by call in each case. For more details you can start here.

Normal function calls in C or C++ (or pretty much any other language) will generally always follow this properly nested pattern². So you don't need to do anything special to take advantage of the return prediction.

Failure Modes

In cases where call/ret aren't paired up normally, the predictions can fail in (at least) a couple of different ways:

If the stack pointer or the return value on the stack is manipulated so that a ret doesn't return the place that the corresponding call pushed, you'll get a branch target prediction failure for that ret, but subsequent normally nested ret instructions will continue to predict correctly as long as they are correctly nested. For example, if at function you add a few bytes to the value at [rsp] in order to skip over the instruction following the call in the calling function, the next ret will mispredict, but the ret that follows inside the calling function should be fine.
On the other hand, the call and ret functions aren't properly nested, the whole return prediction buffer can become misaligned, causing future ret instructions, if any, that use the existing values to mispredict^2.5. For example, if you call into a function, but then use jmp to return to the caller, there is a mismatched call without a ret. The ret inside the caller will mispredict, and so will the ret inside the caller of the caller, and so on, until all misaligned values are used up or overwritten³. A similar case would occur if you had a ret not matched with a corresponding call (and this case is important for the subsequent analysis).

Rather than the two rules above , you can also simply determine the behavior of the return predictor by tracing through the code and tracking what the return stack looks like at each point. Every time you have a ret instruction, see if it returns to the current top of the return stack - if not, you'll get a misprediction.

Misprediction Cost

The actual cost of a misprediction depends on the surrounding code. A figure of ~20 cycles is commonly given and is often seen in practice, but the actual cost can be lower: e.g., as low as zero if the CPU is able to resolve the misprediction early and and start fetching along the new path without interrupting the critical path, or higher: e.g., if the branch prediction failures take a long time to resolve and reduce the effective parallelism of long-latency operations. Regardless we can say that the penalty is usually significant when it occurs in an operation that other takes only a handful of instructions.

Fast Coroutines

Existing Behavior for Coresume and Coyield

The existing _yield (context switch) function swaps the stack pointer rsp and then uses ret to return to a different location than what the actually caller pushed (in particular, it returns to location that was pushed onto the caller stack when the caller called yield earlier). This will generally cause a misprediction at the ret inside _yield.

For example, consider the case where some function A0 makes a normal function call to A1, which it turn calls coresume⁴ to resume a coroutine B1, which later calls coyield to yield back to A1. Inside the call to coresume, the return stack looks like A0, A1, but then coresume swaps rsp to point to the stack for B1 and the top value of that stack is an address inside B1 immediately following coyield in the code for B1. The ret inside coresume hence jumps to a point in B1, and not to a point in A1 as the return stack expects. Hence you get a mis-prediction on that ret and the return stack looks like A0.

Now consider what happens when B1 calls coyield, which is implemented in basically the same way coresume: the call to coyield pushes B1 on the return stack which now looks like A0, B1 and then swaps the stack to point to A1 stack and then does the ret which will return to A1. So the ret mispredicition will happen in the same way, and the stack is left as A0.

So the bad news is that a tight series of calls to coresume and coyield (as is typical with a yield-based iterator, for example), will mispredict each time. The good news is that now inside A1 at least the return stack is correct (not misaligned) - if A1 returns to its caller A0, the return is correctly predicted (and so on when A0 returns to its caller, etc). So you suffer a mispredict penalty each time, but at least you don't misalign the return stack in this scenario. The relative importance of this depends on how often you are calling coresume/coyield versus calling functions normally in the below the function that is calling coresume.

Making It Fast

So can we fix the misprediction? Unfortunately, it's tricky in the combination of C and external ASM calls, because making the call to coresume or coyield implies a call inserted by the compiler, and it's hard to unwind this in the asm.

Still, let's try.

Use Indirect Calls

One approach is get of using ret at all and just use indirect jumps.

That is, just replace the ret at the end of your coresume and coyield calls with:

pop r11
jmp r11

This is functionally equivalent to ret, but affects the return stack buffer differently (in particular, it doesn't affect it).

If analyze the repeated sequence of coresume and coyield calls as above, we get the result that the return stack buffer just starts growing indefinitely like A0, A1, B1, A1, B1, .... This occurs because in fact we aren't using the ret at all in this implementation. So we don't suffer return mis-predictions, because we aren't using ret! Instead, we rely on the accuracy of the indirect branch predictor to predict jmp11.

How that predictor works depends on how coresume and coyeild are implemented. If they both call a shared _yield function that isn't inlined there is only a single jmp r11 location and this jmp will alternately go to a location in A1 and B1. Most modern indirect predictors will repredict this simple repeating pattern fine, although older ones which only tracked a single location will not. If _yield got inlined into coresume and coyield or you just copy-pasted the code into each function, there are two distinct jmp r11 call sites, each which only see a single location each, and should be well-predicted by any CPU with an indirect branch predictor⁶.

So this should generally predict a series of tight coyield and coresume calls well⁷, but at the cost of obliterating the return buffer, so when A1 decides to return to A0 this will be mispredicted as well as subsequent returns by A0 and so on. The size of this penalty is bounded above by the size of the return stack buffer, so if you are making many tight coresume/yield calls this may be a good tradeoff.

That's the best I can think of within the constraint of external calls to functions written in ASM, because you already have an implied call for your co routines, and you have to make the jump to the other couroutine from inside there and I can't see how to keep the stacks balanced and return to the correct location with those constraints.

Inlined Code at the Call Site

If you can inline code at the call-site of your couroutine methods (e.g., with compiler support or inline asm), then you can perhaps do better.

The call to coresume could be inlined as something like this (I've omitted the register saving and restoring code because that's straightforward):

; rcx - current context
; rdc - context for coroutine we are about to resume

; save current non-volatile regs (not shown)
; load non-volatile regs for dest (not shown)
lea r11, [rsp - 8]
mov [rcx + 64], r11 ; save current stack pointer
mov r11, [rdx + 64] ; load dest stack pointer
call [r11]

Note that coresume doens't actually do the stack swap - it just loads the destination stack into r11 and then does a call against [r11] to jump to the coroutine. This is necessary so that that call correctly pushes location we should return to on the stack of the caller.

Then, coyield would look something like (inlined into the calling function):

; save current non-volatile regs (not shown)
; load non-volatile regs for dest (not shown)
lea r11, [after_ret]
push r11             ; save the return point on the stack
mov  rsp, [rdx + 64] ; load the destination stack
ret
after_ret:
mov rsp, r11

When a coresume call jumps to the coroutine it ends up at after_ret, and before executing the user code the mov rsp, r11 instruction swaps to the proper stack for the coroutine which has been stashed in r11 by coresume.

So essentially coyield has two parts: the top half executed before the yield (which occurs at the ret call) and the bottom half which completes the work started by coresume. This allows you to use call as the mechanism to do the coresume jump and ret to do the coyield jump. The call/ret are balanced in this case.

I've glossed over some details of this approach: for example, since there is no function call involved, the ABI-specified non-volatile registers aren't really special: in the case of inline assembly you'll need to indicate to the compiler which variables you will clobber and save the rest, but you can choose whatever set is convenient for you. Choosing a larger set of clobbered variables makes the coresume/coyield code sequences themselves shorter, but potentially puts more register pressure on the surrounding code and may force the compiler to spill more surrounding you code. Perhaps the ideal is just to declare everything clobbered and then the compiler will just spill what it needs.

¹ Of course, there are limitations in practice: the size of the return stack buffer is likely limited to some small number (e.g., 16 or 24) so once the depth of the call stack exceeds that, some return addresses are lost and won't be correctly predicted. Also, various events like a context switch or interrupt are likely to mess up the return-stack predictor.

² An interesting exception was a common pattern for reading the current instruction pointer in x86 (32-bit) code: there is no instruction to do this directly, so instead a call next; next: pop rax sequence can be used: a call to the next instruction which serves only the push the address on the stack which is popped off. There is no corresponding ret. Current CPUs actually recognize this pattern however and don't unbalance the return-address predictor in this special case.

^2.5 How many mispredictions this implies depends on how may net returns the calling function does: if it immediately starts calling down another deep chain of calls, the misaligned return stack entries may never be used at all, for example.

³ Or, perhaps, until the return address stack is re-aligned by a ret without a corresponding call, a case of "two wrongs make a right".

⁴ You haven't actually shown how coyield and coresume actually call _yield, so for the rest of the question I'll assume that they are implemented essentially as _yield is, directly within coyield or coresume without calling _yield: i.e., copy and paste the _yield code into each function, possible with some small edits to account for the difference. You can also make this work by calling _yield, but then you have an additional layer of calls and rets that complicates the analysis.

⁵ To the extent these terms even make sense in a symmetric couroutine implementation, since there is in fact no absolute notion of caller and callee in that case.

⁶ Of course, this analysis applies only to the simple case that you have a single coresume call calling into a coroutine with a single coyield call. More complex scenarios are possible, such as multiple coyield calls inside the callee, or multiple coresume calls inside the caller (possibly to different couroutines). However, the same pattern applies: the case with split jmp r11 sites will present a simpler steam than the combined case (possibly at the cost of more iBTB resources).

⁷ One exception would be the first call or two: the ret predictor needs no "warmup" but the indirect branch predictor may, especially when another coroutine has been called in the interim.

*Current CPUs actually recognize this pattern however and don't unbalance the return-address predictor in this special case.* Do you have a source for the claim that `call / pop` is special? Gcc's get-EIP thunks use `call` to a function that does `mov ebx, [esp]` / `ret`, to avoid unbalancing the return-predictor. (But clang does inline call/pop in 32-bit PIC functions). — Peter Cordes, May 23 '18 at 02:13
@PeterCordes - I got it [from here](http://blog.stuffedcow.net/2018/04/ras-microbenchmarks/) - look for `get_current_ip`. Of the tested CPUs, only the PPro and the Nano failed to special case this. So gcc's strategy may have been chosen at a time around PPro (and perhaps PII and PIII which were all based on a similar design) where this didn't work but it seems most chips get this right today. — BeeOnRope, May 23 '18 at 02:16
That's excellent news, I'll update a couple SO Q&As where call/pop was warned against! Interesting that older Intel CPUs apparently treat the RAS as a circular buffer. I've read (in Spectre-mitigation patches) that SKL uses the regular indirect-branch predictor when the return-predictor underflows, so maybe that's a new feature. I notice Henry didn't test mention SKL/Skylake. :/ — Peter Cordes, May 23 '18 at 02:22
*all future ret instructions that use the existing values* sounds scarier than it is. If `coyield` is somewhat high up in the call tree, those values on the predictor stack might never be used until the end of the program. After the return from `coyield` itself mispredicts, its caller hopefully just calls other functions in a loop, rather than returning far up the stack. (i.e. putting coyield high in the call tree will reduce mispredicted `ret`s and could be a useful optimization, if you're free to move it around.) — Peter Cordes, May 23 '18 at 02:24
Yeah, I heard that about Skylake. It probably still treats it as circular on the `call` side (no overflow checking - that doesn't make sense from a performance POV anyways), but has a depth counter or something so it can detect underflow. — BeeOnRope, May 23 '18 at 02:26
@PeterCordes - yes, I know, that's why I said "uses the existing values" - of course there might be very few actual values used. Maybe I could make this sound less scary. — BeeOnRope, May 23 '18 at 02:27
Yeah, it's technically accurate but some people who aren't reading carefully might think that *all* future `ret` instructions in the whole program will get mispredicted. Not sure what the best wording would be; I had a hard time expressing that clearly + succinctly in comments on the question. — Peter Cordes, May 23 '18 at 02:29
@PeterCordes - I tried to make it a bit clearer, and added a section on how you can do this better if you can inline code at the call site. — BeeOnRope, May 23 '18 at 02:50
@BeeOnRope I'm unable to inline coyield and coresume. They are implemented in terms of _coyield, and _coresume, which are in assembly. No compiler I know of, AFAIK, is able to do inlining across the ASM <-> C boundary. — Jesse Lactin, May 25 '18 at 01:02
Yes, then you can ignore that solution. It is there at least partly to point out that better solutions are possible on compilers that support inline asm, or with compiler support. The first solution doesn't require inline asm. Perhaps there is some way to get the performance of the second solution without inline asm, but nothing jumps to mind (perhaps labels-as-values would work to trigger an indirect jump, but that's a gcc specific extension). @JesseLactin — BeeOnRope, May 25 '18 at 01:25
@BeeOnRope What does it look like in the standard, slow(-ish?) case, where the C co* functions call into _co* functions written in assembly. I just can't get my head around it. I'm using the free version of Visual Studio, so no debugger. — Jesse Lactin, May 25 '18 at 02:45
@JesseLactin What do you mean by "it" when you say what does it look like? The performance? I covered it above in detail. If you are talking about the correctness of the implementation of the functions themselves, it's out of scope for this answer. — BeeOnRope, May 25 '18 at 02:54
@JesseLactin One comment though: you'll really need a debugger for this kind of stuff. Visual Studio community (free) includes a great debugger, but there are plenty of others, like WinDbg, or gdb (you can use an IDE to wrap gdb to make it less insane). — BeeOnRope, May 25 '18 at 03:08