While some high level languages have native support for coroutines, most languages widely in use today have no language nor standard library support. C++ is one of them. In this article, it will be shown a pure coroutine library design that needs no language support.
Unfortunately the kind of coroutines implementable as a library in C++ are
A theoretical description of coroutines is beyond the scope of this article (refer to the Wikipedia article for that).
Be advised that the definition of coroutines and continuations used here might not exactly match the definition given in other sources.
Examples are shown as pseudo code or as actual C++ code (some examples use both styles). The pseudocode is written in a non-existent typeless language. Variables need not to be declared before use and sometimes the operations are written in plain english. Pseudocode look like this:
C++ code is written almost as real code except that functions and objects are never qualified , functions can take a variable number of arguments, and often code is omitted. Also note that the examples heavily use of std::tr1::bind and std::tr1::tuple. C++ code looks like this:
The three basic coroutine operations are creation, invocation and yielding:
The reason for the self parameter will be shown later, for now it will suffice to say that it is a non-const reference to the coroutine object and it passed to the body by the coroutine itself.
It is important to note that while the coroutine constructor is templated on the function object type, the coroutine object itself is only templated on the parameter type and return type (the coroutine signature), thus two coroutines that implement a different body and the same signature can be freely exchanged. This parallels the std::tr1::function interface.
yield takes only an argument if the coroutine return type is not a tuple, else it takes as many arguments as there are elements in the return tuple. The yield member function cannot be invoked outside the owning coroutine body.
In these languages, calling, from coroutine A, a subroutine B with the current continuation means that the subroutine B is invoked and the passed parameter is the logical state that A (captured inside a continuation function) would have right after the call to B. The subroutine B never returns, but instead may resume the subroutine A state at any time invoking the passed continuation argument. Optionally subroutine B may pass to subroutine A a continuation as parameter.
From A's point of view, this parameter is logically returned from the call-with-current-continuation invocation. A's continuation is for all effects a subroutine itself and can in turn be called with the current continuation from B. In this case A and
A solution to this problem that also allows to call/cc to a non coroutine is the following transformation:
The transformation can be made implicit with a call/cc operator that does the automatic wrapping and yielding. As an optimization if the target subroutine is already a coroutine, no temporary is created but call/cc simply yields to the target. In fact on many systems call/cc could be further optimized to never require a temporary coroutine even in the subroutine case. This optimization would preclude the possibility of reentering the called subroutine and thus its utility is questionable.
In fact call/cc can be seen as another way to create a coroutine.
While call/cc can be implemented using yield, it is not a yield point. The current coroutine has not logically yielded and the only way to reenter it is through the continuation object. If the called subroutine returns or yields (in case of coroutines), control is returned to the caller of the current coroutine (as with yield).
A C++ implementation of call/cc (called call_cc), is a template function that takes as parameter a function object that also takes as parameter a function object (the continuation). call_cc must also be templated on the argument type of the continuation. This template parameter must be specified by the caller because in the general case it cannot be deduced. This is also the result type of the call_cc invocation.
Note that a C++ implementation will non use the exact transformation illustrated above because we need to temporarily change the signature of the current coroutine before performing the call (or else we would be restricted on the type of subroutines that we can call). Also note that in C++ the called subroutine is not allowed to return or yield, because return type compatibility cannot be guaranteed. The only thing it can do is call the continuation.
The following C++ code:
A further option is to add a post_exit member function that will simply mark a coroutine as pending a cancellation, but will not immediately resume the target. Next time the coroutine is resumed, the cancellation proceed as with normal exit.
If post_exit is called from inside the target coroutine itself, it is not yet clear if cancellation should be performed when the coroutine yields next or at next resumption.
This role is fulfilled by the any_coroutine wrapper. This wrapper has exactly the same interface of the standard coroutine, but invocation and yielding are not type checked until execution time. Passing an invalid parameter to any member function will result in an exception (bad_type) being thrown at run time. Note that there must be an exact match between the types of the original signature and the type passed at run time.
As a final operation, any_coroutine can be assigned back to a coroutine having the right signature. The following C++ code demonstrates its uses:
The final decision on witch option to use will be taken at implementation time on a case by case basis.
We do not provide a syntax at this time for this functionality, because it is not yet clear which syntax would be the best, or if this functionality should be included at all.
Ideally a coroutine should have the same copy semantics of a std::tr1::function, that is each copy is a distinct object. While theoretically possible, unfortunately this is not possible in practice because copying would require duplicating not only the data associated with the function object implementing the coroutine, but also its internal state and stack. Considering that stack objects might (and are likely to) have non trivial copy constructors, for this reason (and many others), stack copying is not feasible.
Other possible solutions are to make coroutines non copiable (thus limiting their usage patterns), or stating that a coroutine copy is not equal to the original coroutine but instead it is in the not-yet-started state. This last option is the the worst because its semantics are surprising. As reference counting is required anyway for other functionalities, the first option is probably the best. The same reasoning can be done for Assignment.
It is easy to see that coroutines can be used to simulate a cooperative threading environment. Each coroutine represents a thread. When a coroutine no longer has work to do (or is waiting for some operation to finish), it relinquish control to the scheduler that in turn yields to the next ready coroutine.
This model works so well that in fact many threading libraries are implemented this way. Unfortunately this has caused many to see coroutines only as a poor version of threads. This is a common misconception, and it only aggravated by the fact that many already existing coroutine libraries try to simulate a thread-like API, concentrating more on the thread aspect than on the coroutine aspect.
In fact coroutines and threads are orthogonal concepts. Real preemptive threads can preempt each other, run concurrently, and have fixed or dynamic priorities. On the other hand threads require proper locking of shared data that can lead to deadlock if done incorrectly, it is very hard to predict witch thread will run next and when, context switches are relatively expensive and scheduling is complex.
Coroutines instead have strict sequential ordering dictated by the yield points, can never preempt each others (they are a from of cooperative multitasking), shared data does not require locking, it easier to avoid deadlocking (it can be demonstrated that the synchronous rendevouz with yield can never lead to a deadlock), provide extremely lightweight context switches and simple scheduling (if there is scheduling at all). It must be clear now that coroutines are strong where threads are weak and vice versa.
The following points illustrate common uses of preemptive threads.
Consider a program that performs multiple tasks; each task does some data processing then writes data to disk. After the program has completed the first batch of data processing it will do the disk write and will be suspended until the operation is completed, even if it could start the processing of the second task. This causes all tasks to be serialized (every task must wait for the previous one to complete) and reduces concurrency. In this scenario the most obvious way to increase concurrency is to another thread. This way while the first thread is waiting for the write to finish, the other thread can go on and start another tasks. Unfortunately the second thread will eventually block on a system call. If the first thread as finished the write, it may start processing the third task, else all tasks are blocked, and the CPU is wasted. A third thread is needed. Ideally a very large number of threads minimize the chance that all of them will be blocked at the same time.
When using multiple threads, each thread can be reserved to perform a single task or, in more advanced configurations, they can be pooled to pick the first available task from a queue.
This solution is not free because threads are a costly resource and most systems put hard limits on the amount of threads that an application can spawn; also, as the number of threads increases, the process of scheduling them becomes slower. The cost of thread switching is also a problem. Finally preemption in this use case becomes a burden instead of an advantage, because it is not really necessary and forces locking around all shared data structures.
In fact there are better solutions to avoid blocking blocking operations, in the form of non blocking or asynchronous syscalls. A non blocking syscall notifies the caller if an operation cannot be completed at the moment and returns immediately. The caller is responsible of retrying the operation at a latter time; in this case the system usually provides a way to notify the caller when the operation can be completed. An asynchronous syscall initiates the operation but returns the control to the caller immediately. The operation is executed asynchronously (how and when depend on the system and on the type of the syscall, it might take advantage of hardware parallelism, another thread, or simply delay the execution to an idle period); the caller can be notified of the completion (or eventual failure) of the operation at a synchronization point.
Applications written around blocking and asynchronous syscalls are usually event driven, with a state machine for each task to be completed. This kind of applications is usually hard to reason about and often bug prone as it require a great amount of programmer discipline. For this reason, many programmers, whenever performance is not critical, prefer to use threads instead of resorting to asynchronous programming.
The software parallelism problem can sometime be seen as an hardware parallelism problem, because often a syscall blocks whenever another component of the system is doing some work independent from the CPU (for example the DMA engine might be transferring data from disk or network interface to memory). Still asynchronous calls are a better solution to this problem than threads.
While the first two use cases are clearly the realm of threads, the third is where coroutines really shine as they are well suited to implement state machines. In this context they are used similarly to the way threads are used (i.e. to have a context that waits on operation completion), but without all problems derived with the use of threads. The next section will expand the coroutine model to deal better with asynchronous operations.
We have said that coroutines can be used as a better substitute for threads to handle blocking operations. We will now show how this can be done, by developing a complete coroutine-based design to deal with event driven applications.
For every task that must be handled in an event driven application, there is an associated state machine. At every given time a task is in a specific state waiting for one or more events. When an event arrives, the receiving task switches to another state according to a function of the current state and of the event received. Switching state might also produce an output. Usually this state machine is implemented by storing the current state as a set of task data plus a pointer to function to be invoked to process the incoming event. The function may change the task data and set the function pointer to the function that will handle the next state.
Splitting the handling of a task (i.e. all its states) across a set of unrelated multiple functions is undesirable because control is passed from function to function in a way that is hard to reason about. Badly implemented state machines easily become a maintenance nightmare. Coroutines allow us to keep a task execution flow in one specific place and greatly simplify state machine implementation.
With coroutines, the task data is stored inside the associated coroutine stack. When an event arrives, the receiving task coroutine is resumed, processes the event eventually changing task data and finally yields. The yield point holds the current state.
The scheduler is that part of the design that is responsible for running the low level parts of the state machines: it receives events from the operating systems, matches them to the coroutines that are waiting for them, add these coroutines to a FIFO queue of ready tasks and finally calls the first coroutine in the queue (performing the invoke operation). The currently executing coroutine will eventually return to the scheduler by executing a return (that will terminate the task) or with a yield. Note that there is no requirement that this is the same coroutine that has been invoked by scheduler as it could have yielded to another one. When the coroutine has returned, the scheduler will check if it has terminated, if it is waiting for an event or it has simply yielded. In the first case it will simply delete the coroutine. In the second case will add it to the list of coroutines waiting for events. In the third case the coroutine will be simply added on the back of the FIFO queue. This solution favors active task against versus waiters: as long as the FIFO queue is not empty, the scheduler will be executing a coroutine and will not receive any events from the operating system; thus a coroutine that never block will starve all blocked coroutines. Another option is to receive events from the system before invoking a coroutine for the second time (this can be done cleanly by delegating the event reception to another coroutine that is always ready and thus in the FIFO queue). Which options is more appropriate depends on the system and should be a parameter of the scheduler object.
From the point of view of the scheduler, all tasks are equals, so are all coroutines that implement them. This means that all coroutines scheduled by the the scheduler must have this same signature: coroutine<void(scheduler&)>.
A coroutine that want to start an asynchronous operation needs add itself to the wait queue of that operation, mark itself as waiting and yield back to the scheduler. This operation need interaction with the scheduler and its semantics must be clearly defined.
First of all we define as an asynchronous operation a function object that takes as parameter another function object (the callback) and has the following semantics: upon invocation will start an associated asynchronous operation. When the operation is completed (successfully or not), it will invoke the callback, passing the result of the operation to it. It is guaranteed that the function object is not called at the time of the async function call, but only when an explicit call to a (unspecified and operation specific) synchronization function is done Note that this definition of async function is not specific of this model, but very general and, in varied forms, is already used in many asynchronous APIs (i.e. POSIX async I/O, Boost.Asio, etc.)
The idea is that the supplied callback parameter can be used to add the invoking coroutine to the task ready queue. This special callback is called the awakener and it is meant to be tightly bound to the scheduler as it needs to access the FIFO queue. Calling the async function with the awakener fulfills the roles of starting the operations and adding the coroutine to the operation wait queue. The mark-as-waiting must once again be done with strict cooperation with the scheduler. Once this has been done, the coroutine can yield back to the scheduler. We define the operation call_and_wait that performs all operations described so far. It takes as a parameter an async function object, and return as result the tuple of the arguments that the async function would pass to its callback. call_and_wait calls the async function passing to it the appropriate awakener. Then it yields. When the operation is completed, the awakener is invoked that in turn adds the coroutine to the ready queue. The scheduler will eventually resume the coroutine right inside strong>call_and_wait, that will extract the result from the awakener and will return it to the caller. This operation needs to have access to internal scheduler datastructures, thus is best implemented as a scheduler member function. The following code will perform an async read:
The code is much more complicated, but still manageable. In fact what makes event driven code hard to read and understands are not the async calls themselves, but the subdivision of state handlers across a myriad of functions. When a small amount of concurrent async functions are used inside a self contained piece of code they may help a lot. This is why this design does not try to hide the async complexity. Note that the previous code did use the operation wait_any that, as the name implies, waits for the any one of the futures to become true. The wait_all operation is also possible, that obviously waits until all futures become true. As an use case consider the following pseudocode that simply performs two simultaneous writes and wait for completion (much simpler to understand than the previous snippet:
The C++ implementation of call, wait_any and wait_all, is similar to that of call_and_wait (in fact this last one will probably be implemented with the help of the other more basic operations). Awakeners can be still used, but care must be taken in the wait_all case to wake up the coroutine only when all pending operations have completed. The previous pseudocode would be written in C++:
In C++ the future will hold the result arguments that the async call will pass to the awakener. Note that is illegal not to capture the result of
The Asio asynchronous network library has been recently accepted into Boost. This library not only provides useful asynchronous I/O services, but also some very powerful and generic async patterns. Asio async functions have exactly the same interface expected by call and call_and_wait operations. The asio io_service works as the lowest level event pump needed by the coroutine scheduler, whose implementation is greatly simplified (not much more than a coroutine queue is need). Asio thread safety guarantees will also be handy when we will mix coroutines and threads.
In the "Coroutine or Threads" section we have discussed how threads and coroutines are orthogonal to each others, but we did not say anything about using coroutines in a threaded program. The argument will be discussed here and it will be shown how Boost.Asio makes it easy to use coroutines in a threaded program.
While this solution has been widely used in the past, most recent versions of mainstream operating systems are moving to a purely kernel based threading model (BSD based systems seems to be the exception). The problem is trying to extend the exact same semantics of kernel threads to user threads: preemption, priority, same behavior with respect to the rest of the system, correct handling of blocking calls etc. Coroutines do not try to be threads and thus might be better suited for a simpler two level scheduler: a preemptive scheduler in the kernel and a cooperative scheduler in userspace, keeping the distinction between threads and coroutines.
Coroutines are an useful addition to C++, that could open the possibilities of new interesting idioms not common in mainstream languages. While many cooperative userspace thread libraries are available for C++, most of them do not implement a real coroutine interface.Your author intend to implement the interface shown in this article to produce a working coroutine library. Currently only an early prototype version is available in the sourceforge CVS. It has a different (and more limited) interface than that proposed here, but it demonstrates that coroutines can be implemented efficiently and can be made to work very well with the Boost.Asio library.
As the life time of subroutines is nested (i.e. the last called subroutine is the first to return) the most natural choice to implement them is using a single call stack, as most languages do. Coroutines in the other hand have no fixed lifetime and can call each other in dynamic patterns. This implies that multiple per-coroutine stacks must be used, and a way is needed to switch the CPU context from one stack to the other. In practice most operating systems already provide such a mechanism. Windows supports fibers, while UNIX variants provide the swapcontext family of functions. Where these APIs are not available usually the setjmp/longjump C routines can be hijacked to perform the required stack switch, as described in the Portable Multithreading paper from the GNU Pth developers.The theoretically most problematic issue is exception handling. Almost all systems (Windows being the notable exception) do not define their exception handling mechanism, and altering the stack and registers might adversely influence the it. In practice, on most systems, the handling seems to be predictable and works well with context switching. Those few systems that break can be dealt with specific workarounds.
Context switching between coroutines usually requires saving all registers (instruction pointer included), switching the stack pointer, restoring the register and finally resuming execution at the new instruction pointer. In contrast subroutine calls are not required to save all registers, and the subroutine call instruction is usually heavily optimized on modern microprocessor architectures. The cost of a context switch is at least as much as a call trough a function pointer, that in many architectures defies branch prediction optimizations. Also switching the stack might be more expensive than a simple (stack pointer) register swap, as stack related instructions can be fast pathed and be slowed down by stack switching. The new stack might also not be in the CPU cache and thus require a very expensive cache miss to load it.
Even when considering these facts, coroutine performances seem very promising: it is too early to give hard estimates, but simple tests with the prototype implementation give a 300% penalty on a simple null scheduling test (that is, scheduling a set of empty coroutines through an asio demuxer), versus the same test done with empty function objects; when the coroutines actually do something (and thus the context switching overhead is amortized, like in a token passing test where many coroutines pass a token around with TCP/IP channels), the penalty is less than 5% versus the same test implemented with functions. The performance can be increased more by playing with exotic compiler optimization options than switching from coroutines to plain functions.
Other experiments have been done trying to eliminate system calls from the context swap path. For example most swapcontext implementations make a syscall to save and restore the sigmask. This is not necessary for coroutines because we do not give any guarantees for the signal mask. Modifying the source code of GNU glibc swapcontext not to perform the syscall increases raw context switching performance by about 30%. Portable syscall-free context switching can be done with _setjmp and _longjmp. Also experiments have been done with stack prefetching using specialized assembler instructions found on modern systems. Unfortunately those tests have proven inconclusive so far (probably because the tests do not stress the cache subsystem hard enough that catch locality makes a difference).