This answers only addresses the "is it fast" part of the question.
Return Address Prediction
First, a brief description of the behavior of a typical return-address predictor.
- Every time a
call
is made, the return address that is pushed on the actual stack is also stored inside a CPU structure called the return address buffer or something like that.
- When a
ret
(return) is made, the CPU assumes the destination will be the address currently at the top of the return address buffer, and that entry from return address buffer is "popped".
The effect is to perfectly1 predict call
/ret
pairs, as long as they occur in their usual properly nested pattern and that ret
is actually removing the unmodified return address pushed by call
in each case. For more details you can start here.
Normal function calls in C or C++ (or pretty much any other language) will generally always follow this properly nested pattern2. So you don't need to do anything special to take advantage of the return prediction.
Failure Modes
In cases where call
/ret
aren't paired up normally, the predictions can fail in (at least) a couple of different ways:
- If the stack pointer or the return value on the stack is manipulated so that a
ret
doesn't return the place that the corresponding call
pushed, you'll get a branch target prediction failure for that ret
, but subsequent normally nested ret
instructions will continue to predict correctly as long as they are correctly nested. For example, if at function you add a few bytes to the value at [rsp]
in order to skip over the instruction following the call
in the calling function, the next ret
will mispredict, but the ret
that follows inside the calling function should be fine.
- On the other hand, the
call
and ret
functions aren't properly nested, the whole return prediction buffer can become misaligned, causing future ret
instructions, if any, that use the existing values to mispredict2.5. For example, if you call
into a function, but then use jmp
to return to the caller, there is a mismatched call
without a ret
. The ret
inside the caller will mispredict, and so will the ret
inside the caller of the caller, and so on, until all misaligned values are used up or overwritten3. A similar case would occur if you had a ret
not matched with a corresponding call (and this case is important for the subsequent analysis).
Rather than the two rules above , you can also simply determine the behavior of the return predictor by tracing through the code and tracking what the return stack looks like at each point. Every time you have a ret
instruction, see if it returns to the current top of the return stack - if not, you'll get a misprediction.
Misprediction Cost
The actual cost of a misprediction depends on the surrounding code. A figure of ~20 cycles is commonly given and is often seen in practice, but the actual cost can be lower: e.g., as low as zero if the CPU is able to resolve the misprediction early and and start fetching along the new path without interrupting the critical path, or higher: e.g., if the branch prediction failures take a long time to resolve and reduce the effective parallelism of long-latency operations. Regardless we can say that the penalty is usually significant when it occurs in an operation that other takes only a handful of instructions.
Fast Coroutines
Existing Behavior for Coresume and Coyield
The existing _yield
(context switch) function swaps the stack pointer rsp
and then uses ret
to return to a different location than what the actually caller pushed (in particular, it returns to location that was pushed onto the caller
stack when the caller called yield
earlier). This will generally cause a misprediction at the ret
inside _yield
.
For example, consider the case where some function A0
makes a normal function call to A1
, which it turn calls coresume
4 to resume a coroutine B1
, which later calls coyield
to yield back to A1
. Inside the call to coresume
, the return stack looks like A0, A1
, but then coresume
swaps rsp
to point to the stack for B1
and the top value of that stack is an address inside B1
immediately following coyield
in the code for B1
. The ret
inside coresume
hence jumps to a point in B1
, and not to a point in A1
as the return stack expects. Hence you get a mis-prediction on that ret
and the return stack looks like A0
.
Now consider what happens when B1
calls coyield
, which is implemented in basically the same way coresume
: the call to coyield
pushes B1
on the return stack which now looks like A0, B1
and then swaps the stack to point to A1
stack and then does the ret
which will return to A1
. So the ret
mispredicition will happen in the same way, and the stack is left as A0
.
So the bad news is that a tight series of calls to coresume
and coyield
(as is typical with a yield-based iterator, for example), will mispredict each time. The good news is that now inside A1
at least the return stack is correct (not misaligned) - if A1
returns to its caller A0
, the return is correctly predicted (and so on when A0
returns to its caller, etc). So you suffer a mispredict penalty each time, but at least you don't misalign the return stack in this scenario. The relative importance of this depends on how often you are calling coresume
/coyield
versus calling functions normally in the below the function that is calling coresume
.
Making It Fast
So can we fix the misprediction? Unfortunately, it's tricky in the combination of C and external ASM calls, because making the call to coresume
or coyield
implies a call inserted by the compiler, and it's hard to unwind this in the asm.
Still, let's try.
Use Indirect Calls
One approach is get of using ret
at all and just use indirect jumps.
That is, just replace the ret
at the end of your coresume
and coyield
calls with:
pop r11
jmp r11
This is functionally equivalent to ret
, but affects the return stack buffer differently (in particular, it doesn't affect it).
If analyze the repeated sequence of coresume
and coyield
calls as above, we get the result that the return stack buffer just starts growing indefinitely like A0, A1, B1, A1, B1, ...
. This occurs because in fact we aren't using the ret
at all in this implementation. So we don't suffer return mis-predictions, because we aren't using ret
! Instead, we rely on the accuracy of the indirect branch predictor to predict jmp11
.
How that predictor works depends on how coresume
and coyeild
are implemented. If they both call a shared _yield
function that isn't inlined there is only a single jmp r11
location and this jmp
will alternately go to a location in A1
and B1
. Most modern indirect predictors will repredict this simple repeating pattern fine, although older ones which only tracked a single location will not. If _yield
got inlined into coresume
and coyield
or you just copy-pasted the code into each function, there are two distinct jmp r11
call sites, each which only see a single location each, and should be well-predicted by any CPU with an indirect branch predictor6.
So this should generally predict a series of tight coyield
and coresume
calls well7, but at the cost of obliterating the return buffer, so when A1
decides to return to A0
this will be mispredicted as well as subsequent returns by A0
and so on. The size of this penalty is bounded above by the size of the return stack buffer, so if you are making many tight coresume/yield
calls this may be a good tradeoff.
That's the best I can think of within the constraint of external calls to functions written in ASM, because you already have an implied call
for your co
routines, and you have to make the jump to the other couroutine from inside there and I can't see how to keep the stacks balanced and return to the correct location with those constraints.
Inlined Code at the Call Site
If you can inline code at the call-site of your couroutine methods (e.g., with compiler support or inline asm), then you can perhaps do better.
The call to coresume
could be inlined as something like this (I've omitted the register saving and restoring code because that's straightforward):
; rcx - current context
; rdc - context for coroutine we are about to resume
; save current non-volatile regs (not shown)
; load non-volatile regs for dest (not shown)
lea r11, [rsp - 8]
mov [rcx + 64], r11 ; save current stack pointer
mov r11, [rdx + 64] ; load dest stack pointer
call [r11]
Note that coresume
doens't actually do the stack swap - it just loads the destination stack into r11
and then does a call
against [r11]
to jump to the coroutine. This is necessary so that that call
correctly pushes location we should return to on the stack of the caller.
Then, coyield
would look something like (inlined into the calling function):
; save current non-volatile regs (not shown)
; load non-volatile regs for dest (not shown)
lea r11, [after_ret]
push r11 ; save the return point on the stack
mov rsp, [rdx + 64] ; load the destination stack
ret
after_ret:
mov rsp, r11
When a coresume
call jumps to the coroutine it ends up at after_ret
, and before executing the user code the mov rsp, r11
instruction swaps to the proper stack for the coroutine which has been stashed in r11
by coresume
.
So essentially coyield
has two parts: the top half executed before the yield (which occurs at the ret
call) and the bottom half which completes the work started by coresume
. This allows you to use call
as the mechanism to do the coresume
jump and ret
to do the coyield
jump. The call
/ret
are balanced in this case.
I've glossed over some details of this approach: for example, since there is no function call involved, the ABI-specified non-volatile registers aren't really special: in the case of inline assembly you'll need to indicate to the compiler which variables you will clobber and save the rest, but you can choose whatever set is convenient for you. Choosing a larger set of clobbered variables makes the coresume
/coyield
code sequences themselves shorter, but potentially puts more register pressure on the surrounding code and may force the compiler to spill more surrounding you code. Perhaps the ideal is just to declare everything clobbered and then the compiler will just spill what it needs.
1 Of course, there are limitations in practice: the size of the return stack buffer is likely limited to some small number (e.g., 16 or 24) so once the depth of the call stack exceeds that, some return addresses are lost and won't be correctly predicted. Also, various events like a context switch or interrupt are likely to mess up the return-stack predictor.
2 An interesting exception was a common pattern for reading the current instruction pointer in x86 (32-bit) code: there is no instruction to do this directly, so instead a call next; next: pop rax
sequence can be used: a call
to the next instruction which serves only the push the address on the stack which is popped off. There is no corresponding ret
. Current CPUs actually recognize this pattern however and don't unbalance the return-address predictor in this special case.
2.5 How many mispredictions this implies depends on how may net returns the calling function does: if it immediately starts calling down another deep chain of calls, the misaligned return stack entries may never be used at all, for example.
3 Or, perhaps, until the return address stack is re-aligned by a ret
without a corresponding call, a case of "two wrongs make a right".
4 You haven't actually shown how coyield
and coresume
actually call _yield
, so for the rest of the question I'll assume that they are implemented essentially as _yield
is, directly within coyield
or coresume
without calling _yield
: i.e., copy and paste the _yield
code into each function, possible with some small edits to account for the difference. You can also make this work by calling _yield
, but then you have an additional layer of calls and rets that complicates the analysis.
5 To the extent these terms even make sense in a symmetric couroutine implementation, since there is in fact no absolute notion of caller and callee in that case.
6 Of course, this analysis applies only to the simple case that you have a single coresume
call calling into a coroutine with a single coyield
call. More complex scenarios are possible, such as multiple coyield
calls inside the callee, or multiple coresume
calls inside the caller (possibly to different couroutines). However, the same pattern applies: the case with split jmp r11
sites will present a simpler steam than the combined case (possibly at the cost of more iBTB resources).
7 One exception would be the first call or two: the ret
predictor needs no "warmup" but the indirect branch predictor may, especially when another coroutine has been called in the interim.