Why stackless coroutines require dynamic allocation?

Question

This question is not about coroutines in C++20 but coroutines in general.

I'm learning C++20 coroutines these days. I've learnt about stackful and stackless coroutines from Coroutines Introduction. I've also SO-ed for more infomation.

Here's my understanding about stackless coroutines:

A stackless coroutine does has stack on the caller's stack when it's running.
When it suspends itself, as stackless coroutines can only suspend at the top-level function, its stack is predictable and useful data are stored in a certain area.
When it's not running, it doesn't have a stack. It's bound with a handle, by which the client can resume the coroutine.

The Coroutines TS specifies that the non-array operator new is called when allocating storage for coroutine frames. However, I think this is unnecessary, hence my question.

Some explanation/consideration:

Where to put the coroutine's status instead? In the handle, which originally stores the pointer.
Dynamic allocation doesn't mean storing on the heap. But my intent is to elide calls to operator new, no matter how it is implemented.
From cppreference:
The call to operator new can be optimized out (even if custom allocator is used) if
- The lifetime of the coroutine state is strictly nested within the lifetime of the caller, and
- the size of coroutine frame is known at the call site
For the first requirement, storing the state directly in the handle is still okay if the coroutine outlives the caller.

For the other, if the caller doesn't know the size, how can it compose the argument to call operator new? Actually, I can't even imagine in which situation the caller doesn't know the size.
Rust seems to have a different implementation, according to this question.

The coroutine store the stack of the outer function and also the state of CPU registers. — Oliv, Jul 18 '20 at 11:12
@Oliv The outer function? The whole stack or the stack pointer? The stack pointer is known when suspending from the top-level function. — jerry_fuyi, Jul 18 '20 at 12:17
"*Rust seems to have a different implementation, according to this question.*" Rust is also a different language. — Nicol Bolas, Jul 18 '20 at 13:38

score 7 · Answer 1 · answered Jul 18 '20 at 13:40

7

A stackless coroutine does has stack on the caller's stack when it's running.

That right there is the source of your misunderstanding.

Continuation-based coroutines (which is what a "stackless coroutine" is) is a coroutine mechanism that is designed for being able to provide a coroutine to some other code which will resume its execution after some asynchronous process completes. This resumption may take place in some other thread.

As such, the stack cannot be assumed to be "on the caller's stack", since the caller and the process that schedules the coroutine's resumption are not necessarily in the same thread. The coroutine needs to be able to outlive the caller, so the coroutine's stack cannot be on the caller's stack (in general. In certain co_yield-style cases, it can be).

The coroutine handle represents the coroutine's stack. So long as that handle exists, so too does the coroutine's stack.

When it's not running, it doesn't have a stack. It's bound with a handle, by which the client can resume the coroutine.

And how does this "handle" store all of the local variables for the coroutine? Obviously they are preserved (it'd be a bad coroutine mechanism if they weren't), so they have to be stored somewhere. The name given for where a function's local variables are is called the "stack".

Calling it a "handle" doesn't change what it is.

But my intent is to elide calls to operator new, no matter how it is implemented.

Well... you can't. If never calling new is a vital component of writing whatever software you're writing, then you can't use co_await-style coroutine continuations. There is no set of rules you can use that guarantees elision of new in coroutines. If you're using a specific compiler, you can do some tests to see what it elides and what it doesn't, but that's it.

The rules you cite are merely cases that make it possible to elide the call.

For the other, if the caller doesn't know the size, how can it compose the argument to call operator new?

Remember: co_await coroutines in C++ are effectively an implementation detail of a function. The caller has no idea if any function it calls is or is not a coroutine. All coroutines look like regular functions from the outside.

The code for creating a coroutine stack happens within the function call, not outside of it.

answered Jul 18 '20 at 13:40

Nicol Bolas

449,505
63
781
982

*"The code for creating a coroutine stack happens within the function call, not outside of it."* This seems to be the crux of the issue, right? The fundamental flaw seems to be that you can't identify a coroutine by merely looking at its declaration, which means the caller cannot know how much space to reserve for its frame. In principle, though, it should be possible for the caller to overprovision space, and for the callee to verify the space is large enough. The only question is... how do you make that happen if you can enforce single-threading. Seems like there isn't a way? – user541686 Dec 11 '20 at 00:14
@user541686: "*The fundamental flaw*" If by "flaw" you mean "core aspect of its design", then yes. Co_await coroutines are not meant to have their stacks created by the caller. They *can* in certain specific circumstances, but that's not the point of the feature. The primary usage pattern is for threaded scenarios where the receiver of the data is processing something else while the producer of it is working. In that scenario, the caller creating the call stack makes no sense. – Nicol Bolas Dec 11 '20 at 01:37
1

@user541686: `co_await` coroutines *could* be created on the caller's stack, but only under certain specific circumstances. This is only possible when the body of the coroutine is visible to the caller, and when the caller itself is the only one receiving data from the coroutine (ie: no threading). – Nicol Bolas Dec 11 '20 at 01:39
So I'm thinking you don't actually need the body of the coroutine to be visible to the caller, because the caller only needs to know 1 integer: how many bytes to reserve. In fact, that's not only unnecessary, but also potentially insufficient even in a single-threaded scenario, because the maximum number of coroutine allocations might still be unknown (they might not all be local variables, or they might be in VLAs). Would this approach I explained [below](https://stackoverflow.com/questions/62967789/why-stackless-coroutines-require-dynamic-allocation#comment115345529_63017635) work? – user541686 Dec 11 '20 at 01:54
1

@user541686: "*because the caller only needs to know 1 integer: how many bytes to reserve*" Which it can only know by compiling that function. Which requires its definition. And VLAs aren't legal in C++. Indeed, I don't think C++ has any mechanisms that requires a runtime stack size mechanism. – Nicol Bolas Dec 11 '20 at 02:46
@user541686: "*they might not all be local variables*" If it's not a local variable, then it's going to have to go on the heap anyway. – Nicol Bolas Dec 11 '20 at 02:47
(I meant it might be on some buffer somewhere but not literally a local variable of that type (think `aligned_storage`); it doesn't have to be on a heap. But anyway.) I think you didn't read my [comment below](https://stackoverflow.com/questions/62967789/why-stackless-coroutines-require-dynamic-allocation#comment115345529_63017635) that I referred to? Because I explained there how you can avoid needing to compile the definition a priori. – user541686 Dec 11 '20 at 03:27
@user541686: Yes, I read your comment. It would require *runtime* stack manipulation, which as I said is not a thing C++ does for any language feature. – Nicol Bolas Dec 11 '20 at 03:43
I mean it isn't done currently, but why couldn't it be? Mainstream compilers already support `alloca`; they just need to call it. More limited implementations that can't do that (for whatever reasons) can just use the heap like they already can anyway. I don't really see what the obstacle would be here—it enables this optimization without enforcing it on anyone, right? All it takes from a language standpoint is a keyword to mark a function as a coroutine at the declaration site, which is a pretty small price to pay. – user541686 Dec 11 '20 at 03:54
1

@user541686: "*but why couldn't it be?*" Because the basic design of `co_await` coroutines is *intended* for use in threaded circumstances. That they could be used for local generators is an optimization possibility, not the primary use case. As such, that optimization requires the compiler to know that the coroutine's implementation doesn't schedule its resumption elsewhere or any funny-business like that. And the standard doesn't even spell out what circumstances would trigger such optimizations; it's purely a quality-of-life thing. – Nicol Bolas Dec 11 '20 at 05:46
Nitpick: there's no requirement that the resumption isn't resumed on another thread for this elision to work, and in fact, there is a good use case for this (fork-and-join parallelism). In fact, a small experiment I ran shows that clang is able to elide the use of "operator new" even when I pass off the coroutine handle to an unknown external function at the suspend point. You can get this to happen by simply calling `destroy()` on the coroutine handle before you leave the function, as a hint to the compiler that you've performed the necessary synchronization. – rdb Dec 28 '22 at 02:40
@rdb: Why bother implementing it as a coroutine? If the calling function isn't getting back an object that they will use to interact with the coroutine's return value, if they're just saying "go do some async task"... why bother with having it be a coroutine? Just hand the function pointer off to the task manager and move on. – Nicol Bolas Dec 28 '22 at 02:43
@NicolBolas The calling function *does* want to wait for the coroutine to be done, that's what makes the lifetime of the coroutine bounded. You have a co_await at the join point that suspends the coroutine if the join isn't satisfied, then it's the calling function that can decide whether to ultimately turn that into a stall, or go off and do some other work, or also co_await because maybe it's also a coroutine. This usage does not require dynamic allocation, but does take advantage of the syntactic sugar that coroutines offer. – rdb Dec 28 '22 at 13:19

Yakk - Adam Nevraumont · Answer 2 · 2020-07-23T11:50:45.217

The fundamental difference between stackful and stackless coroutines is if the coroutine owns a full, theoretically unbounded stack (but practically bounded) like a thread does.

In a stackful coroutine, the local variables of the coroutine are stored on the stack it owns, like anything else, both during execution and when suspended.

In a stackless coroutine, the local variables to the coroutine can be in the stack while the coroutine is running or not. They are stored in a fixed sized buffer that the stackless coroutine owns.

In theory, a stackless coroutine can be stored on someone else's stack. There is, however, no way to guarantee within C++ code that this happens.

Elision of operator new in the creation of a coroutine is sort of about doing that. If your coroutine object is stored on someone's stack, and new was elided because there is enough room in the coroutine object itself for its state, then the stackless coroutine that lives completely on someone else's stack is possible.

There is no way to guarantee this in the current implementation of C++ coroutines. Attempts to get that in where met with resistance by compiler developers, because the exact minimal capture that a coroutine does happens "later" than the time they need to know how big the coroutine is in their compiler.

This leads to the difference in practice. A stackful coroutine acts more like a thread. You can call normal functions, and those normal functions can interact within their bodies with coroutine operations like suspend.

A stackless coroutine cannot call a function with then interacts with the coroutine machinery. Interacting with the coroutine machinery is only permitted within the stackless coroutine itself.

A stackful coroutine has all of the machinery of a thread without being scheduled on the OS. A stackless coroutine is an augmented function object that has goto labels in it that let it be resumed part way through its body.

There are theoretical implementations of stackless coroutines that don't have the "could call new" feature. The C++ standard doesn't require such a type of stackless coroutine.

Some people proposed them. Their proposals lost out to the current one, in part because the current one was far more polished and closer to being shipped than the alternative proposals where. Some of the syntax of the alternative proposals ended up in the successful proposal.

I believe there was a convincing argument that the "stricter" fixed size no-new coroutine implementations where not ruled out by the current proposal, and could be added on afterwards, and that helped kill the alternative proposals.

*"Because the exact minimal capture that a coroutine does happens "later" than the time they need to know how big the coroutine is in their compiler."* I'm not really convinced this is a blocker. If the compiler emits runtime metadata (like a vtable) for a coroutine indicating its frame size, then the caller would then be able to effectively look up the frame size and call `alloca()`, then pass the buffer to the coroutine. All the programmer needs is some mechanism for denoting a coroutine at the declaration site. This should allow allocation at the call site, right? — user541686, Dec 11 '20 at 00:24
(^Obviously this might not work in *every* scenario, like when the number of coroutine allocations itself is unknown, so you would still need the dynamic allocation fallback, but I think this would make it possible to avoid dynamic allocation in 'nice' cases?) — user541686, Dec 11 '20 at 00:27
@user541686 The general complaint about coroutines is that a programmer cannot statically guarantee "this coroutine fits in this memory right here, in this structure, booya". Most other C++ structures can do this, even if the size is unspecified by the standard and could vary between C++ implementations. It is true that compilers can optimize coroutines to not use dynamic allocation, but that really isn't quite the same. — Yakk - Adam Nevraumont, Dec 11 '20 at 04:00
But how does `alloca` vs. `operator new` change that story? Either way, the caller has no idea how big the structure is, but it's not like like anybody cares anyway, just like how nobody cares how big a vtable is. To be clear, I was thinking if you could say `[[co_routine]]` (or even `[[dynamic_frame]]`) before `generator iota(int n) { ... }`, then this would generate metadata (an integer) indicating the frame size, and it would permit (but not require) the caller to call `alloca` in the beginning with that size *in lieu of `operator new`*. I see upsides but... no downsides? — user541686, Dec 11 '20 at 04:14

Oliv · Answer 3 · 2020-07-18T13:28:06.133

3

Consider this hypothetical case:

void foo(int);

task coroutine() {
    int a[100] {};
    int * p = a;
    while (true) {
       co_await awaitable{};
       foo (*p);
       }
    }

p points to the first element of a, if between two resumptions, a's memory location changed, p would not hold the right address.

Memory for what would be the function stack must be allocated in such a way that it is conserved between a suspension and its following resumption. But this memory cannot be moved or copied if some objects refers to objects that are within this memory (or at least not without adding complexity). This is a reason why, sometime, compilers need to allocate this memory on the heap.

edited Jul 18 '20 at 13:28

answered Jul 18 '20 at 13:21

Oliv

17,610
1
29
72

You've proved that in this case the coroutine's stack can't be located on the caller's stack, but that makes me more confused: where is the coroutine's stack? (there must be because it may call other functions) If it's in the handle, it may overflow; if it's on the heap, what about stacklessness? And where else can it be? – jerry_fuyi Jul 18 '20 at 14:45
@jerry_fuyi: "*If it's in the handle, it may overflow*" The stacks for a specific function is not a dynamic property. The stack for the functions a coroutine calls are not the same stack as that of the coroutine itself. "*if it's on the heap, what about stacklessness*" The term "stackless" is merely to contrast with "stackful" coroutines, where the entire call stack from some point is preserved. A "stackless" coroutine only preserves the stack of the particular function that suspended. – Nicol Bolas Jul 18 '20 at 14:57
@jerry_fuyi The stack of the coroutine can grow on the resumer/caller stack. Only the part of the stack that must be preserved accross suspension -> resumption must be stored on the heap. I suppose it can be seen on compiler explorer, if you compile it, targeting x86-64, I will bet that local variable are not referenced relatively to rsp. – Oliv Jul 18 '20 at 15:46
@jerry_fuyi If you look at this [generated code](https://godbolt.org/z/8qsheY) you will see that the local array a, which must be preserved accross suspention/resumption is allocated on the heap (see line 13 and 25 of the generated assembly), the local array b, which does not need to be on the heap is also allocated on the heap (but that a missed optimization). On the other hand, you can see that `rsp` is not manipulated before the call to `foo`: the stack is the caller/resumer stack. – Oliv Jul 18 '20 at 17:44

score 0 · Answer 4 · answered Apr 08 '23 at 12:42

Don't confuse the stack of a coroutine with the state of a coroutine.

A stackfull coroutine hosts both, state and stack on a seperate frame allocated somewhere on the heap.

A stackless coroutine hosts its state in a frame on the heap but uses the stack of the resumer thread to push and pop values. If those values are significant for the state of the coroutine, the push and pop operations will directly influence the state fields in the frame, if not it will just use the stack for temporary processing. How does the compiler decide which operations influence the state of the coroutine? It does this in a process called coroutine transformation during compilation.

As you might guess, stackfull coroutines have one big drawback which is that you never really know in advance how much heap to allocate for the full stackframe (which is the case with every thread as well). But instead of bothering with this once per thread, you have to bother with this everytime you create a coroutine.

Why stackless coroutines require dynamic allocation?

4 Answers4