A Coroutine Model for C++
This is Draft Version 0.9. (May 05 2006)
This article will be updated with future design changes.


From Wikipedia's coroutine article:

"In computer science, coroutines are program components that generalize subroutines to allow multiple entry points and suspending and resuming of execution at certain locations"

While some high level languages have native support for coroutines, most languages widely in use today have no language nor standard library support. C++ is one of them. In this article, it will be shown a pure coroutine library design that needs no language support.

Unfortunately the kind of coroutines implementable as a library in C++ are not a mathematical superset of a subroutine (that this, in general subroutines are not coroutines), as they are in other languages. For that it would require language and compiler support. This is likely not to be desirable because it would considerably slow down normal subroutines.

A theoretical description of coroutines is beyond the scope of this article (refer to the Wikipedia article for that).

Be advised that the definition of coroutines and continuations used here might not exactly match the definition given in other sources.

Stylistic Notes

Examples are shown as pseudo code or as actual C++ code (some examples use both styles). The pseudocode is written in a non-existent typeless language. Variables need not to be declared before use and sometimes the operations are written in plain english. Pseudocode look like this:

function some_function(first_argument, second_argument, ... nth_argument) { /* function body here */ some_variable = some_value another_function (parm) /* invokes function another_function with argument parm */ while(true) { do something /* plain english body */ } ... /* some code has been omitted here */ return value }

C++ code is written almost as real code except that functions and objects are never qualified , functions can take a variable number of arguments, and often code is omitted. Also note that the examples heavily use of std::tr1::bind and std::tr1::tuple. C++ code looks like this:

result_type some_function(parm_type_1 parm1, parm_type_2 parm2, .... parm_type_n parmn) { ... }

Basic Coroutine Operations

The three basic coroutine operations are creation, invocation and yielding:

Advanced Coroutine Operations
Beyond the three basic coroutine operations, there are some more non strictly necessary coroutine operations that can be useful in practice.
Other Issues
Other operations on coroutines are possible, but it is non yet clear if there is really a need for them. There is also a number of open issues that are not resolved at this point. Their resolution requires further thinking possibly at implementation time.
Coroutine Applications
We will show here some general applications of coroutines and their particular implementation in C++.
Coroutines Or Threads?

It is easy to see that coroutines can be used to simulate a cooperative threading environment. Each coroutine represents a thread. When a coroutine no longer has work to do (or is waiting for some operation to finish), it relinquish control to the scheduler that in turn yields to the next ready coroutine.

This model works so well that in fact many threading libraries are implemented this way. Unfortunately this has caused many to see coroutines only as a poor version of threads. This is a common misconception, and it only aggravated by the fact that many already existing coroutine libraries try to simulate a thread-like API, concentrating more on the thread aspect than on the coroutine aspect.

In fact coroutines and threads are orthogonal concepts. Real preemptive threads can preempt each other, run concurrently, and have fixed or dynamic priorities. On the other hand threads require proper locking of shared data that can lead to deadlock if done incorrectly, it is very hard to predict witch thread will run next and when, context switches are relatively expensive and scheduling is complex.

Coroutines instead have strict sequential ordering dictated by the yield points, can never preempt each others (they are a from of cooperative multitasking), shared data does not require locking, it easier to avoid deadlocking (it can be demonstrated that the synchronous rendevouz with yield can never lead to a deadlock), provide extremely lightweight context switches and simple scheduling (if there is scheduling at all). It must be clear now that coroutines are strong where threads are weak and vice versa.

The following points illustrate common uses of preemptive threads.

While the first two use cases are clearly the realm of threads, the third is where coroutines really shine as they are well suited to implement state machines. In this context they are used similarly to the way threads are used (i.e. to have a context that waits on operation completion), but without all problems derived with the use of threads. The next section will expand the coroutine model to deal better with asynchronous operations.

The Event Driven Model

We have said that coroutines can be used as a better substitute for threads to handle blocking operations. We will now show how this can be done, by developing a complete coroutine-based design to deal with event driven applications.

For every task that must be handled in an event driven application, there is an associated state machine. At every given time a task is in a specific state waiting for one or more events. When an event arrives, the receiving task switches to another state according to a function of the current state and of the event received. Switching state might also produce an output. Usually this state machine is implemented by storing the current state as a set of task data plus a pointer to function to be invoked to process the incoming event. The function may change the task data and set the function pointer to the function that will handle the next state.

Splitting the handling of a task (i.e. all its states) across a set of unrelated multiple functions is undesirable because control is passed from function to function in a way that is hard to reason about. Badly implemented state machines easily become a maintenance nightmare. Coroutines allow us to keep a task execution flow in one specific place and greatly simplify state machine implementation.

With coroutines, the task data is stored inside the associated coroutine stack. When an event arrives, the receiving task coroutine is resumed, processes the event eventually changing task data and finally yields. The yield point holds the current state.

Coroutine Scheduler.

The scheduler is that part of the design that is responsible for running the low level parts of the state machines: it receives events from the operating systems, matches them to the coroutines that are waiting for them, add these coroutines to a FIFO queue of ready tasks and finally calls the first coroutine in the queue (performing the invoke operation). The currently executing coroutine will eventually return to the scheduler by executing a return (that will terminate the task) or with a yield. Note that there is no requirement that this is the same coroutine that has been invoked by scheduler as it could have yielded to another one. When the coroutine has returned, the scheduler will check if it has terminated, if it is waiting for an event or it has simply yielded. In the first case it will simply delete the coroutine. In the second case will add it to the list of coroutines waiting for events. In the third case the coroutine will be simply added on the back of the FIFO queue. This solution favors active task against versus waiters: as long as the FIFO queue is not empty, the scheduler will be executing a coroutine and will not receive any events from the operating system; thus a coroutine that never block will starve all blocked coroutines. Another option is to receive events from the system before invoking a coroutine for the second time (this can be done cleanly by delegating the event reception to another coroutine that is always ready and thus in the FIFO queue). Which options is more appropriate depends on the system and should be a parameter of the scheduler object.

From the point of view of the scheduler, all tasks are equals, so are all coroutines that implement them. This means that all coroutines scheduled by the the scheduler must have this same signature: coroutine<void(scheduler&)>.

Starting an Asynchronous Operation

A coroutine that want to start an asynchronous operation needs add itself to the wait queue of that operation, mark itself as waiting and yield back to the scheduler. This operation need interaction with the scheduler and its semantics must be clearly defined.

First of all we define as an asynchronous operation a function object that takes as parameter another function object (the callback) and has the following semantics: upon invocation will start an associated asynchronous operation. When the operation is completed (successfully or not), it will invoke the callback, passing the result of the operation to it. It is guaranteed that the function object is not called at the time of the async function call, but only when an explicit call to a (unspecified and operation specific) synchronization function is done Note that this definition of async function is not specific of this model, but very general and, in varied forms, is already used in many asynchronous APIs (i.e. POSIX async I/O, Boost.Asio, etc.)

The idea is that the supplied callback parameter can be used to add the invoking coroutine to the task ready queue. This special callback is called the awakener and it is meant to be tightly bound to the scheduler as it needs to access the FIFO queue. Calling the async function with the awakener fulfills the roles of starting the operations and adding the coroutine to the operation wait queue. The mark-as-waiting must once again be done with strict cooperation with the scheduler. Once this has been done, the coroutine can yield back to the scheduler. We define the operation call_and_wait that performs all operations described so far. It takes as a parameter an async function object, and return as result the tuple of the arguments that the async function would pass to its callback. call_and_wait calls the async function passing to it the appropriate awakener. Then it yields. When the operation is completed, the awakener is invoked that in turn adds the coroutine to the ready queue. The scheduler will eventually resume the coroutine right inside strong>call_and_wait, that will extract the result from the awakener and will return it to the caller. This operation needs to have access to internal scheduler datastructures, thus is best implemented as a scheduler member function. The following code will perform an async read:

scheduler &sched = ...; stream_type stream = ...; buffer_type buffer = ...; ... size_t len; bool error; tie(len, error) = sched.call_and_wait<size_t, bool>(self, bind(async_read, buffer, stream, _1));
Note that the return type of call_and_wait (that is equal to the argument type list of the awakener) must be explicitly specified. Also note that while internally implemented with a yield, call_and_wait is not a general yield point. We do not want to accidentally yield to a coroutine waiting for a not-yet-complete operation, thus yielding will be disabled (that is, the coroutine will not be considered stopped at a yield point), until the awakener callback has not be called.
Multiple pending operations
call_and_wait makes it possible to easily transform an asynchronous call in a synchronous call. With this operation any function based state machine can be transformed to a coroutine based state machine. But call_and_wait actually implements two different operations, the async calling and the waiting. We will now reverse the composition process and deconstruct call_and_wait in two operations: the async call and the synchronization point. This is desirable because often the same task needs to do more than one operation concurrently. Consider for example a task whose job is to forward data from one pipe to the another and vice versa. Using call_and_wait (or threads), the task must be assigned to two different coroutines even if it logically belongs to a single task:
function forwarder(source, dest) { while(source not empty) { call_and_wait(bind(async_read, source, data)) call_and_wait(bind(async_write, dest, data)) } } stream A, B scheduler add coroutine lambda () { forwarder(A, B) } scheduler add coroutine lambda () { forwarder(B, A) } run scheduler
Splitting the actual call from the wait makes it possible to use a single coroutine.
function bidirectional_forwader(A, B) { future_a_read = false; future_b_write = true; future_b_read = false; future_a_write = true; while(B not empty or A not empty) { if(future_b_write == true) { future_b_write = false; future_a_read = call(bind(async_read, A, data_ab)) } if(future_a_read == true) { future_a_read = false; future_b_write = call(bind(async_write, B, data_ab)) } if(future_a_write == true) { future_a_write = false; future_b_read = call(bind(async_read, B, data_ba)) } if(future_b_read == true) { future_b_read = false; future_a_write = call(bind(async_write, A, data_ba)) } wait_any(future_a_read, future_b_read, future_a_write, future_b_write) } stream A, B scheduler add coroutine lambda () { bidirectional_forwarder(A, B) } run scheduler }
Futures objects are used to hold the result of an async function. They will be false until the operation completes, at which point they will hold the result of the operation (or true if the operation returns nothing).

The code is much more complicated, but still manageable. In fact what makes event driven code hard to read and understands are not the async calls themselves, but the subdivision of state handlers across a myriad of functions. When a small amount of concurrent async functions are used inside a self contained piece of code they may help a lot. This is why this design does not try to hide the async complexity. Note that the previous code did use the operation wait_any that, as the name implies, waits for the any one of the futures to become true. The wait_all operation is also possible, that obviously waits until all futures become true. As an use case consider the following pseudocode that simply performs two simultaneous writes and wait for completion (much simpler to understand than the previous snippet:

stream A, B data_a = ... data_b = ... future_a = call(bind(async_write, A, data_a)) future_b = call(bind(async_write, B, data_b)) wait_all(future_a, future_b)

The C++ implementation of call, wait_any and wait_all, is similar to that of call_and_wait (in fact this last one will probably be implemented with the help of the other more basic operations). Awakeners can be still used, but care must be taken in the wait_all case to wake up the coroutine only when all pending operations have completed. The previous pseudocode would be written in C++:

stream_type A = ...; stream_type B = ...; buffer_type buffer_a = ...; buffer_type buffer_b = ...; ... future<size_t, bool> result_a = sched.call<size_t, bool>(bind(async_write, A, buffer_a)); future<size_t, bool> result_b = sched.call<size_t, bool>(bind(async_write, A, buffer_a)); sched.wait_all(result_a, result_b); assert(result_a); //convertible to bool, will be true here assert(result_b); //convertible to bool, will be true here size_t len; bool res; tie(len, res) = *result_a; //boost::optional interface.

In C++ the future will hold the result arguments that the async call will pass to the awakener. Note that is illegal not to capture the result of sched.call in a future (because the actual async call will be done inside the future, where the awakener is bound to the address of the result storage). A bound future object cannot be destroyed before it has been joined with a wait. Finally note that wait is a special yield point, in the same way that call_and_wait is.

Boost.Asio

The Asio asynchronous network library has been recently accepted into Boost. This library not only provides useful asynchronous I/O services, but also some very powerful and generic async patterns. Asio async functions have exactly the same interface expected by call and call_and_wait operations. The asio io_service works as the lowest level event pump needed by the coroutine scheduler, whose implementation is greatly simplified (not much more than a coroutine queue is need). Asio thread safety guarantees will also be handy when we will mix coroutines and threads.

Coroutines And Threads

In the "Coroutine or Threads" section we have discussed how threads and coroutines are orthogonal to each others, but we did not say anything about using coroutines in a threaded program. The argument will be discussed here and it will be shown how Boost.Asio makes it easy to use coroutines in a threaded program.

Conclusions

Coroutines are an useful addition to C++, that could open the possibilities of new interesting idioms not common in mainstream languages. While many cooperative userspace thread libraries are available for C++, most of them do not implement a real coroutine interface.

Your author intend to implement the interface shown in this article to produce a working coroutine library. Currently only an early prototype version is available in the sourceforge CVS. It has a different (and more limited) interface than that proposed here, but it demonstrates that coroutines can be implemented efficiently and can be made to work very well with the Boost.Asio library.

Appendix A: Coroutine Implementation

As the life time of subroutines is nested (i.e. the last called subroutine is the first to return) the most natural choice to implement them is using a single call stack, as most languages do. Coroutines in the other hand have no fixed lifetime and can call each other in dynamic patterns. This implies that multiple per-coroutine stacks must be used, and a way is needed to switch the CPU context from one stack to the other. In practice most operating systems already provide such a mechanism. Windows supports fibers, while UNIX variants provide the swapcontext family of functions. Where these APIs are not available usually the setjmp/longjump C routines can be hijacked to perform the required stack switch, as described in the Portable Multithreading paper from the GNU Pth developers.

The theoretically most problematic issue is exception handling. Almost all systems (Windows being the notable exception) do not define their exception handling mechanism, and altering the stack and registers might adversely influence the it. In practice, on most systems, the handling seems to be predictable and works well with context switching. Those few systems that break can be dealt with specific workarounds.

Appendix B: Coroutine Performance.

Context switching between coroutines usually requires saving all registers (instruction pointer included), switching the stack pointer, restoring the register and finally resuming execution at the new instruction pointer. In contrast subroutine calls are not required to save all registers, and the subroutine call instruction is usually heavily optimized on modern microprocessor architectures. The cost of a context switch is at least as much as a call trough a function pointer, that in many architectures defies branch prediction optimizations. Also switching the stack might be more expensive than a simple (stack pointer) register swap, as stack related instructions can be fast pathed and be slowed down by stack switching. The new stack might also not be in the CPU cache and thus require a very expensive cache miss to load it.

Even when considering these facts, coroutine performances seem very promising: it is too early to give hard estimates, but simple tests with the prototype implementation give a 300% penalty on a simple null scheduling test (that is, scheduling a set of empty coroutines through an asio demuxer), versus the same test done with empty function objects; when the coroutines actually do something (and thus the context switching overhead is amortized, like in a token passing test where many coroutines pass a token around with TCP/IP channels), the penalty is less than 5% versus the same test implemented with functions. The performance can be increased more by playing with exotic compiler optimization options than switching from coroutines to plain functions.

Other experiments have been done trying to eliminate system calls from the context swap path. For example most swapcontext implementations make a syscall to save and restore the sigmask. This is not necessary for coroutines because we do not give any guarantees for the signal mask. Modifying the source code of GNU glibc swapcontext not to perform the syscall increases raw context switching performance by about 30%. Portable syscall-free context switching can be done with _setjmp and _longjmp. Also experiments have been done with stack prefetching using specialized assembler instructions found on modern systems. Unfortunately those tests have proven inconclusive so far (probably because the tests do not stress the cache subsystem hard enough that catch locality makes a difference).

Appendix D: Article TODO List.
Appendix E: Changelog.

Giovanni P. Deretta
gpderetta at gmail dot com