Coroutine Library

A Coroutine Model for C++

This is Draft Version 0.9. (May 05 2006)
This article will be updated with future design changes.

From Wikipedia's coroutine article:

"In computer science, coroutines are program components that generalize subroutines to allow multiple entry points and suspending and resuming of execution at certain locations"

While some high level languages have native support for coroutines, most languages widely in use today have no language nor standard library support. C++ is one of them. In this article, it will be shown a pure coroutine library design that needs no language support.

Unfortunately the kind of coroutines implementable as a library in C++ are not a mathematical superset of a subroutine (that this, in general subroutines are not coroutines), as they are in other languages. For that it would require language and compiler support. This is likely not to be desirable because it would considerably slow down normal subroutines.

A theoretical description of coroutines is beyond the scope of this article (refer to the Wikipedia article for that).

Be advised that the definition of coroutines and continuations used here might not exactly match the definition given in other sources.

Stylistic Notes

Examples are shown as pseudo code or as actual C++ code (some examples use both styles). The pseudocode is written in a non-existent typeless language. Variables need not to be declared before use and sometimes the operations are written in plain english. Pseudocode look like this:

function some_function(first_argument, second_argument, ... nth_argument) { /* function body here */ some_variable = some_value another_function (parm) /* invokes function another_function with argument parm */ while(true) { do something /* plain english body */ } ... /* some code has been omitted here */ return value }

C++ code is written almost as real code except that functions and objects are never qualified , functions can take a variable number of arguments, and often code is omitted. Also note that the examples heavily use of std::tr1::bind and std::tr1::tuple. C++ code looks like this:

result_type some_function(parm_type_1 parm1, parm_type_2 parm2, .... parm_type_n parmn) { ... }

Basic Coroutine Operations

The three basic coroutine operations are creation, invocation and yielding:

Creation. Coroutine creation is conceptually introduced with the coroutine keyword, followed by the coroutine name, arguments and body.
coroutine my_coroutine (parm_1, parm_2... parm_n) { /* coroutine body */ }
A possible implementation in C++ is a coroutine object whose (template) constructor takes as a parameter a function object that implements the actual coroutine body:
typedef tuple<result_type1, result_type2,... result_typen> my_coroutine_signature(parm_type1, parm_type2... parm_typen)> class my_coroutine_body { public: tuple<result_type1, result_type2,... result_typen> operator() (coroutine<my_coroutine_signature>self&, parm_type1 parm_1, parm_type2 parm_2,... parm_typen parm_n) { /* function body */ } }; coroutine<my_coroutine_signature> my_coroutine(my_coroutine_body());

The reason for the self parameter will be shown later, for now it will suffice to say that it is a non-const reference to the coroutine object and it passed to the body by the coroutine itself.
It is important to note that while the coroutine constructor is templated on the function object type, the coroutine object itself is only templated on the parameter type and return type (the coroutine signature), thus two coroutines that implement a different body and the same signature can be freely exchanged. This parallels the std::tr1::function interface.
Invocation. In a language with native coroutine as first class functions, invoking a coroutine has exactly the same syntax as a subroutine invocation:
result_1, result_2,... result_n = my_coroutine(parm_val_1, parm_val_2, ...parm_val_n)
In C++ the best thing that can be done is coroutines as function objects. The syntax is still exactly the same as a subroutine invocation:
result_type1 result_1; result_type2 result_2; ... result_typen result_n; tie(result_1, result_2,... result_n) = my_coroutine(parm_val_1, parm_val_2, ... parm_val_n)
The first time a coroutine is invoked, the control flow starts at the beginning of the body. Subsequent invocations resume control at the yield point (see next section). If the body returns, the coroutine becomes terminated and can no longer be resumed. A coroutine can only be reentered after it has yielded. This implies that a coroutine can never invoke (and thus reenter) itself.
Yield. The yield point is where a coroutine returns to its caller. The coroutine body can specify zero or more return values at this point. Next time the coroutine is reentered, the control flow is resumed from the yield point. Any parameters passed to the coroutine are returned from the yield invocation.
parm_1, parm_2, ... parm_n = yield(result_1, result_2, ... result_n)
The C++ syntax is similar, but yield is not a free function but a member function of the coroutine object, because the exact coroutine signature is required to statically check the arguments and return type. This is the reason for a self parameter.
tie(parm_1, parm_2,... parm_n) = self.yield(result_1, result_2,... result_n);

yield takes only an argument if the coroutine return type is not a tuple, else it takes as many arguments as there are elements in the return tuple. The yield member function cannot be invoked outside the owning coroutine body.

Advanced Coroutine Operations

Beyond the three basic coroutine operations, there are some more non strictly necessary coroutine operations that can be useful in practice.

Yield To. Instead of yielding to the caller, a coroutine can yield to another (already suspended) coroutine. When coroutine A yields to coroutine B, A is suspended as with a normal yield. Coroutine B is resumed. If coroutine B does a normal yield, the control flow is restarted at the caller of of coroutine A, as if A had done a normal yield. Of course B could instead yield to a third coroutine. It is also legal for B(or even A itself) to yield to A. In the general case when a coroutine yields to a second coroutine, the caller of the first becomes the caller of the second. Notice the difference between yield to and invocation: if B invokes A, A in turn can no longer yield to B, because B is not suspended at a yield point. Also if A yield to B, when B yields, control is returned to the caller of A. If A invokes B, when B yields, control is returned to A.
parm_1, parm_2, ... parm_n = yield to B(result_1, result_2, ... result_n)
The C++ syntax follows from that:
coroutine<signature_of_B> &B = ...; tie(parm_1, parm_2,... parm_n) = self.yield_to(B, result_1, result_2,... result_n);
Note that the result_type of B must be the same of the result_type of A (because this is what the caller of A would expect). There is no restriction on the type of the parameters.
Call With Current Continuation. A continuation is simply the saved state of a coroutine stopped at a yield point. Some languages have continuations as the fundamental object from witch coroutines are constructed.
In these languages, calling, from coroutine A, a subroutine B with the current continuation means that the subroutine B is invoked and the passed parameter is the logical state that A (captured inside a continuation function) would have right after the call to B. The subroutine B never returns, but instead may resume the subroutine A state at any time invoking the passed continuation argument. Optionally subroutine B may pass to subroutine A a continuation as parameter.
From A's point of view, this parameter is logically returned from the call-with-current-continuation invocation. A's continuation is for all effects a subroutine itself and can in turn be called with the current continuation from B. In this case A and B becomes two coroutines that yield to each other.
It seems to be possible to implement the above functionality with the set of coroutine operations already defined by simply passing a reference to the current coroutine as an argument to the target subroutine. Unfortunately this is illegal, because the subroutine would in turn reenter the current coroutine without it being stopped at a yield point. This is also true if the subroutine is actually another coroutine. If the subroutine is actually another coroutine, calling it with current continuation could be approximated with yield to. The following pseudo code:
return_value = call/cc(B)
where B is another coroutine, might be translated to:
target = yield to B
This requires that the target coroutine be aware that the object that it is being passed is not a generic function, because it must yield to it instead of simply invoking it. In fact in the latter case the original coroutine wouldn't be able to yield to the target any longer.
A solution to this problem that also allows to call/cc to a non coroutine is the following transformation:
function Bf (callback) { /* do something */ ... callback() } coroutine A { ... coroutine Bc (continuation){ Bf(bind(yield to, continuation)) } yield to Bc(self) ... }
That is, a temporary coroutine is created whose body is simply a call to the target subroutine with the operator yield to bound to the current coroutine as parameter. The caller simply yields to the temporary.
The transformation can be made implicit with a call/cc operator that does the automatic wrapping and yielding. As an optimization if the target subroutine is already a coroutine, no temporary is created but call/cc simply yields to the target. In fact on many systems call/cc could be further optimized to never require a temporary coroutine even in the subroutine case. This optimization would preclude the possibility of reentering the called subroutine and thus its utility is questionable.
In fact call/cc can be seen as another way to create a coroutine.
While call/cc can be implemented using yield, it is not a yield point. The current coroutine has not logically yielded and the only way to reenter it is through the continuation object. If the called subroutine returns or yields (in case of coroutines), control is returned to the caller of the current coroutine (as with yield).
A C++ implementation of call/cc (called call_cc), is a template function that takes as parameter a function object that also takes as parameter a function object (the continuation). call_cc must also be templated on the argument type of the continuation. This template parameter must be specified by the caller because in the general case it cannot be deduced. This is also the result type of the call_cc invocation.
Note that a C++ implementation will non use the exact transformation illustrated above because we need to temporarily change the signature of the current coroutine before performing the call (or else we would be restricted on the type of subroutines that we can call). Also note that in C++ the called subroutine is not allowed to return or yield, because return type compatibility cannot be guaranteed. The only thing it can do is call the continuation.
The following C++ code:
template<typename Functor> void reverse_rng(Functor& callback) { callback(random_number()); } ... /* inside a coroutine body */ int random_number = call_cc<int>(reverse_rng);
shows a way (admittedly not the most efficient) of transforming a function that passes its result to a callback in a function that returns a value.
Coroutines and Exceptions. A C++ coroutine library must deal with exceptions. The most useful and easy to implement model is to treat exceptions as local to a coroutine. If a thrown exception is never caught inside a coroutine body, it will cause the coroutine to terminate abnormally. The invoker of that coroutine will be notified by an exception thrown by the operator() (note that this is not the same exception that caused the coroutine to exit; propagating exception is hard and arguably not useful).
Terminating a Coroutine externally. Coroutine termination is also known as cancellation (from POSIX thread cancellation). It is not possible to cancel a coroutine asynchronously (that is while it is not stopped at a yield point), because it would be left in an undefined and unpredictable state (i.e. all automatic objects on the stack cannot be safely destroyed). Still it is possible and desirable to have a way to synchronously terminate a coroutine that is stopped at a yield point. Invoking the exit member function of a coroutine will cause that coroutine to be resumed (with the invoke operation) and an exception be thrown from the yield function. Usually this exception will cause the coroutine to exit, but it can be caught and handled (thus a coroutine can evade cancellation). If a coroutine capture the exit exception cannot yield any more to its caller (nor can any coroutine that the first coroutine yields to). The only way to return to the caller is through a return, that it it must be guaranteed that when exit returns the coroutine has terminated. A coroutine can call itself its own exit member function. In this case the exit exception will be immediately thrown.
A further option is to add a post_exit member function that will simply mark a coroutine as pending a cancellation, but will not immediately resume the target. Next time the coroutine is resumed, the cancellation proceed as with normal exit.
If post_exit is called from inside the target coroutine itself, it is not yet clear if cancellation should be performed when the coroutine yields next or at next resumption.
Coroutine Destructor. When the coroutine destructor is invoked, if the coroutine as not already terminated the exit operation is immediately invoked.

Other Issues

Other operations on coroutines are possible, but it is non yet clear if there is really a need for them. There is also a number of open issues that are not resolved at this point. Their resolution requires further thinking possibly at implementation time.

Dynamic Coroutine. In the C++ model presented so far, coroutines can only be freely exchanged if they have the same signature. That is, even if a type erasure layer is applied to hide the type of the function object implementing the body, its signature is preserved. This is done to enable the compiler to statically check coroutines return and arguments types at compile time. At times, it may be useful to treat all coroutines equally, regardless of their signature. That is an additional type erasure layer is applied, and return and argument type checking is delayed to execution time.
This role is fulfilled by the any_coroutine wrapper. This wrapper has exactly the same interface of the standard coroutine, but invocation and yielding are not type checked until execution time. Passing an invalid parameter to any member function will result in an exception (bad_type) being thrown at run time. Note that there must be an exact match between the types of the original signature and the type passed at run time.
As a final operation, any_coroutine can be assigned back to a coroutine having the right signature. The following C++ code demonstrates its uses:
... coroutine<int(double)> my_coroutine = ...; any_coroutine anyco = my_coroutine; int res = anyco(double(10)); //OK int res = anyco(10); //will throw, int != double in parm 1. double res = anyco(double(10)); //will throw, double != int in return value. self.yield_to(anyco, double(10); //OK, coroutine::yield_to works with any_coroutine. self.yield_to(anyco, 10); //will throw, int != double in parm 1. coroutine<int(double)> my_coroutine2 = anyco; //OK coroutine<double(int)> my_coroutine3 = anyco; //will throw, signature mismatch. ... /*in my_coroutine body */ any_coroutine anyself= self; anyself.yield(10); //OK anyself.yield(double(10)); //will throw, double != int in return value.
Current Coroutine Free Function. Having a generic coroutine wrapper makes it possible to have a global pointer to the current coroutine, and a free function that returns the value of this pointer as an any_coroutine object. While the use of a global value might be considered bad practice, some systems with OS or library support for coroutines, already provide it, thus it might be useful not to hide it. The C++ interface would be:
any_coroutine = current();
Yield as a Free Function. The global coroutine pointer makes it possible to implement yield as a free function. It simply forwards its arguments to current().yield(...). Symmetrically a free function yield_to is possible.
coroutine<any(any)>. A coroutine with the signature coroutine<any(any)> potentially accepts all types and returns all types of objects. While it is already possible to use it with the current coroutine interface, it might be useful to specialize coroutine for the any type. This means that yield and yield_to can sink and produce all types. The same thing is true for operator(). any_coroutine will recognize when it is wrapping this specialization and will not dynamically type check its argument but instead convert them to any objects.
Throw, Assert or Undefined Behavior. In the previous paragraphs, many operations have been declared illegal or not do be done. Some examples are: invoking a terminated coroutine, recursively invoking a coroutine, invoking yield outside a coroutine, etc. We did not specify what would happen in those cases. There are of course three possible solutions: undefined behavior, aborting and throwing an exception. The first option should be used when it is not possible or too costly to check for the error; the second when it is possible to check to the error but the error is likely to be caused by a logical programming error; the third option is for when it is hard for a programmer to avoid the error at coding time.
The final decision on witch option to use will be taken at implementation time on a case by case basis.
Invoke With Callback. Invoking a coroutine require two switches: one from the caller to the coroutine, and another from the coroutine to the caller. Switching to a coroutine and back is, on most implementations, more expensive than a function call. If the next action that the caller would do is to invoke a second coroutine, a switch would be saved if the first coroutine would immediately yield to the second coroutine. Unfortunately this would require the first coroutine to have a reference to the second. This might become impractical in some situations. A simple solution is to have the caller execute some callback code inside the coroutine yield point. This callback is executed inside the coroutine context, and the yield parameters are forwarded to it. It can yield to another coroutine or return. In the first case control is passed to a second coroutine and the code is registered to be executed at that coroutine yield point. In the second case the yield statement returns to the calling coroutine, with the callback return value as the result. In fact even in the first case, when the coroutine is resumed, the control flow resumes inside the callback, at the yield to statement.
We do not provide a syntax at this time for this functionality, because it is not yet clear which syntax would be the best, or if this functionality should be included at all.
Coroutine Copy and Assignment. So far we have taken the assumption that a coroutine is copiable, assignable and all copies of the same coroutine in fact are reference counted aliases.
Ideally a coroutine should have the same copy semantics of a std::tr1::function, that is each copy is a distinct object. While theoretically possible, unfortunately this is not possible in practice because copying would require duplicating not only the data associated with the function object implementing the coroutine, but also its internal state and stack. Considering that stack objects might (and are likely to) have non trivial copy constructors, for this reason (and many others), stack copying is not feasible.
Other possible solutions are to make coroutines non copiable (thus limiting their usage patterns), or stating that a coroutine copy is not equal to the original coroutine but instead it is in the not-yet-started state. This last option is the the worst because its semantics are surprising. As reference counting is required anyway for other functionalities, the first option is probably the best. The same reasoning can be done for Assignment.
Default Construction of Coroutines. Default construction of coroutines is certainly useful, but the semantics to be associated to it are not clear. There are two options: a default constructed coroutine is not callable (as if it had terminated), or it is an alias for the current coroutine. It is very likely that the first option will be taken during implementation, but we have not yet made the decision.

Coroutine Applications

We will show here some general applications of coroutines and their particular implementation in C++.

Generators. A generator is a function that every time it is invoked returns the next element of a sequence. Coroutines make an ideal construct to write complex generators. The following C++ code implements a generator that given a sequence of characters and a string to match, returns all iterators pointing to the beginning of occurrences of the string:
template<BidirectionalIterator> Iterator match_substring(coroutine<Iterator(void)>& self, BidirectionalIterator begin, BidirectionalIterator end, string match) { typedef BidirectionalIterator iterator; for(iterator i = begin; i != end;) { iterator t = i; for(string::iterator j = match.begin(); (j != match.end()) && (t != end); ++j, ++t) { if(*j != *t) { t = end; break; } } if(t != end) self.yield(i); } } self.yield(end); } int main(int, char**) { string buffer = "banananana"; string match = "nana"; coroutine<string::iterator(void)> matcher (bind(match_substring, _1, buffer.begin(), buffer.end(), match)); while(string::iterator i = matcher(), i != buffer.end()) cout << distance(buffer.begin(), i)<<'\n'; }
The previous code will output:
2 4 6
Actor Model. In the actor model, every Actor has its own procedure that executes in its own control flow, that can then give up control to a central scheduler responsible of executing all actors sequentially. Giving each actor its own flow of control greatly simplifies coding allowing the programmer to reason about each actor independently to all others. This model is often used to implement simulations (where each component to be simulated is a different actor). It has been also used in games. The following code shows two coroutines passing each other a token after having incremented its value and printed it:
int yielder(coroutine<int(int)>& self, int token, coroutine<int(int)>& target) { while(true) { cout<< token++<<'\n'; token = self.yield_to(target, token); } } int main(int, char**) { coroutine<int(int)> a; coroutine<int(int)> b(bind(yielder, _1, _2, a)); a = coroutine<int(int)>(bind(yielder, _1, _2, b)); a(0); }
The output is obviously a non terminated sequence of numbers.
State Machines. A state machine is usually implemented by a variable storing the current state and a set of functions, one for each state, that implement the behavior of the machine for that state and that return the next state. Using coroutines, only one function is needed, and the state is not stored in a variable but instead is determined by the current yield point. This result in much more readable code. The following code implements a coroutine that matches the string "hello"
bool matcher(coroutine<bool(char)>& self, char c) { while(true) { if(c == 'h') { c = self.yield(false); if(c == 'e') { c = self.yield(false); if(c == 'l') { c = self.yield(false); if(c == 'l') { c = self.yield(false); if(c == 'o') { c = self.yield(true); } continue; } else continue; } else continue; } else continue; } c = self.yield(false); } } int main(int, char**) { coroutine<bool(char)> match(matcher); string buffer = "hello to everybody, this is not an hello world program."; for(string::iterator i = buffer.begin(); i != buffer.end(); ++i) { if(match(*i)) cout<< "match\n"; } } }
The previous piece of code will print the string "match" two times.

Coroutines Or Threads?

It is easy to see that coroutines can be used to simulate a cooperative threading environment. Each coroutine represents a thread. When a coroutine no longer has work to do (or is waiting for some operation to finish), it relinquish control to the scheduler that in turn yields to the next ready coroutine.

This model works so well that in fact many threading libraries are implemented this way. Unfortunately this has caused many to see coroutines only as a poor version of threads. This is a common misconception, and it only aggravated by the fact that many already existing coroutine libraries try to simulate a thread-like API, concentrating more on the thread aspect than on the coroutine aspect.

In fact coroutines and threads are orthogonal concepts. Real preemptive threads can preempt each other, run concurrently, and have fixed or dynamic priorities. On the other hand threads require proper locking of shared data that can lead to deadlock if done incorrectly, it is very hard to predict witch thread will run next and when, context switches are relatively expensive and scheduling is complex.

Coroutines instead have strict sequential ordering dictated by the yield points, can never preempt each others (they are a from of cooperative multitasking), shared data does not require locking, it easier to avoid deadlocking (it can be demonstrated that the synchronous rendevouz with yield can never lead to a deadlock), provide extremely lightweight context switches and simple scheduling (if there is scheduling at all). It must be clear now that coroutines are strong where threads are weak and vice versa.

The following points illustrate common uses of preemptive threads.

Soft and Hard Realtime Systems. Realtime systems pretty much require the use of preemptive threading. Preemption and priority are required on realtime systems to guarantee schedulability while minimizing latency. Coroutines have strict serial scheduling and are better used for background non realtime jobs.
Hardware Parallelism.While other solutions are available (specialized compilers, OpenMP, processes), threads are the simplest, and often most efficient, way to for an application to take advantage of the hardware parallelism available on multiprocessor systems. Given the concurrent nature of true preemptive threads, a threaded application will run correctly both when its thread are all scheduled on the same CPU or on different CPUs. Coroutines cannot be run concurrently without proper care. Applications written explicitly to take advantage of SMP systems usually use a number of threads proportional to the number of CPUs available on the system.
Software Parallelism. Threads are often used to increase software parallelism with blocking system calls. A blocking call is a syscall that stops the calling process execution until it has terminated. For example a read from a disk must wait for the data to be transfered from the disk to main memory. A read from the network must wait for the remote sender to send the data, and for the network interface to transfer the data to main memory.
Consider a program that performs multiple tasks; each task does some data processing then writes data to disk. After the program has completed the first batch of data processing it will do the disk write and will be suspended until the operation is completed, even if it could start the processing of the second task. This causes all tasks to be serialized (every task must wait for the previous one to complete) and reduces concurrency. In this scenario the most obvious way to increase concurrency is to another thread. This way while the first thread is waiting for the write to finish, the other thread can go on and start another tasks. Unfortunately the second thread will eventually block on a system call. If the first thread as finished the write, it may start processing the third task, else all tasks are blocked, and the CPU is wasted. A third thread is needed. Ideally a very large number of threads minimize the chance that all of them will be blocked at the same time.
When using multiple threads, each thread can be reserved to perform a single task or, in more advanced configurations, they can be pooled to pick the first available task from a queue.
This solution is not free because threads are a costly resource and most systems put hard limits on the amount of threads that an application can spawn; also, as the number of threads increases, the process of scheduling them becomes slower. The cost of thread switching is also a problem. Finally preemption in this use case becomes a burden instead of an advantage, because it is not really necessary and forces locking around all shared data structures.
In fact there are better solutions to avoid blocking blocking operations, in the form of non blocking or asynchronous syscalls. A non blocking syscall notifies the caller if an operation cannot be completed at the moment and returns immediately. The caller is responsible of retrying the operation at a latter time; in this case the system usually provides a way to notify the caller when the operation can be completed. An asynchronous syscall initiates the operation but returns the control to the caller immediately. The operation is executed asynchronously (how and when depend on the system and on the type of the syscall, it might take advantage of hardware parallelism, another thread, or simply delay the execution to an idle period); the caller can be notified of the completion (or eventual failure) of the operation at a synchronization point.
Applications written around blocking and asynchronous syscalls are usually event driven, with a state machine for each task to be completed. This kind of applications is usually hard to reason about and often bug prone as it require a great amount of programmer discipline. For this reason, many programmers, whenever performance is not critical, prefer to use threads instead of resorting to asynchronous programming.
The software parallelism problem can sometime be seen as an hardware parallelism problem, because often a syscall blocks whenever another component of the system is doing some work independent from the CPU (for example the DMA engine might be transferring data from disk or network interface to memory). Still asynchronous calls are a better solution to this problem than threads.

While the first two use cases are clearly the realm of threads, the third is where coroutines really shine as they are well suited to implement state machines. In this context they are used similarly to the way threads are used (i.e. to have a context that waits on operation completion), but without all problems derived with the use of threads. The next section will expand the coroutine model to deal better with asynchronous operations.

The Event Driven Model

We have said that coroutines can be used as a better substitute for threads to handle blocking operations. We will now show how this can be done, by developing a complete coroutine-based design to deal with event driven applications.

For every task that must be handled in an event driven application, there is an associated state machine. At every given time a task is in a specific state waiting for one or more events. When an event arrives, the receiving task switches to another state according to a function of the current state and of the event received. Switching state might also produce an output. Usually this state machine is implemented by storing the current state as a set of task data plus a pointer to function to be invoked to process the incoming event. The function may change the task data and set the function pointer to the function that will handle the next state.

Splitting the handling of a task (i.e. all its states) across a set of unrelated multiple functions is undesirable because control is passed from function to function in a way that is hard to reason about. Badly implemented state machines easily become a maintenance nightmare. Coroutines allow us to keep a task execution flow in one specific place and greatly simplify state machine implementation.

With coroutines, the task data is stored inside the associated coroutine stack. When an event arrives, the receiving task coroutine is resumed, processes the event eventually changing task data and finally yields. The yield point holds the current state.

Coroutine Scheduler.

The scheduler is that part of the design that is responsible for running the low level parts of the state machines: it receives events from the operating systems, matches them to the coroutines that are waiting for them, add these coroutines to a FIFO queue of ready tasks and finally calls the first coroutine in the queue (performing the invoke operation). The currently executing coroutine will eventually return to the scheduler by executing a return (that will terminate the task) or with a yield. Note that there is no requirement that this is the same coroutine that has been invoked by scheduler as it could have yielded to another one. When the coroutine has returned, the scheduler will check if it has terminated, if it is waiting for an event or it has simply yielded. In the first case it will simply delete the coroutine. In the second case will add it to the list of coroutines waiting for events. In the third case the coroutine will be simply added on the back of the FIFO queue. This solution favors active task against versus waiters: as long as the FIFO queue is not empty, the scheduler will be executing a coroutine and will not receive any events from the operating system; thus a coroutine that never block will starve all blocked coroutines. Another option is to receive events from the system before invoking a coroutine for the second time (this can be done cleanly by delegating the event reception to another coroutine that is always ready and thus in the FIFO queue). Which options is more appropriate depends on the system and should be a parameter of the scheduler object.

From the point of view of the scheduler, all tasks are equals, so are all coroutines that implement them. This means that all coroutines scheduled by the the scheduler must have this same signature: coroutine<void(scheduler&)>.

Starting an Asynchronous Operation

A coroutine that want to start an asynchronous operation needs add itself to the wait queue of that operation, mark itself as waiting and yield back to the scheduler. This operation need interaction with the scheduler and its semantics must be clearly defined.

First of all we define as an asynchronous operation a function object that takes as parameter another function object (the callback) and has the following semantics: upon invocation will start an associated asynchronous operation. When the operation is completed (successfully or not), it will invoke the callback, passing the result of the operation to it. It is guaranteed that the function object is not called at the time of the async function call, but only when an explicit call to a (unspecified and operation specific) synchronization function is done Note that this definition of async function is not specific of this model, but very general and, in varied forms, is already used in many asynchronous APIs (i.e. POSIX async I/O, Boost.Asio, etc.)

The idea is that the supplied callback parameter can be used to add the invoking coroutine to the task ready queue. This special callback is called the awakener and it is meant to be tightly bound to the scheduler as it needs to access the FIFO queue. Calling the async function with the awakener fulfills the roles of starting the operations and adding the coroutine to the operation wait queue. The mark-as-waiting must once again be done with strict cooperation with the scheduler. Once this has been done, the coroutine can yield back to the scheduler. We define the operation call_and_wait that performs all operations described so far. It takes as a parameter an async function object, and return as result the tuple of the arguments that the async function would pass to its callback. call_and_wait calls the async function passing to it the appropriate awakener. Then it yields. When the operation is completed, the awakener is invoked that in turn adds the coroutine to the ready queue. The scheduler will eventually resume the coroutine right inside strong>call_and_wait, that will extract the result from the awakener and will return it to the caller. This operation needs to have access to internal scheduler datastructures, thus is best implemented as a scheduler member function. The following code will perform an async read:

scheduler &sched = ...; stream_type stream = ...; buffer_type buffer = ...; ... size_t len; bool error; tie(len, error) = sched.call_and_wait<size_t, bool>(self, bind(async_read, buffer, stream, _1));

Note that the return type of call_and_wait (that is equal to the argument type list of the awakener) must be explicitly specified. Also note that while internally implemented with a yield, call_and_wait is not a general yield point. We do not want to accidentally yield to a coroutine waiting for a not-yet-complete operation, thus yielding will be disabled (that is, the coroutine will not be considered stopped at a yield point), until the awakener callback has not be called.

Multiple pending operations

call_and_wait makes it possible to easily transform an asynchronous call in a synchronous call. With this operation any function based state machine can be transformed to a coroutine based state machine. But call_and_wait actually implements two different operations, the async calling and the waiting. We will now reverse the composition process and deconstruct call_and_wait in two operations: the async call and the synchronization point. This is desirable because often the same task needs to do more than one operation concurrently. Consider for example a task whose job is to forward data from one pipe to the another and vice versa. Using call_and_wait (or threads), the task must be assigned to two different coroutines even if it logically belongs to a single task:

function forwarder(source, dest) { while(source not empty) { call_and_wait(bind(async_read, source, data)) call_and_wait(bind(async_write, dest, data)) } } stream A, B scheduler add coroutine lambda () { forwarder(A, B) } scheduler add coroutine lambda () { forwarder(B, A) } run scheduler

Splitting the actual call from the wait makes it possible to use a single coroutine.

Futures objects are used to hold the result of an async function. They will be false until the operation completes, at which point they will hold the result of the operation (or true if the operation returns nothing).

The code is much more complicated, but still manageable. In fact what makes event driven code hard to read and understands are not the async calls themselves, but the subdivision of state handlers across a myriad of functions. When a small amount of concurrent async functions are used inside a self contained piece of code they may help a lot. This is why this design does not try to hide the async complexity. Note that the previous code did use the operation wait_any that, as the name implies, waits for the any one of the futures to become true. The wait_all operation is also possible, that obviously waits until all futures become true. As an use case consider the following pseudocode that simply performs two simultaneous writes and wait for completion (much simpler to understand than the previous snippet:

stream A, B data_a = ... data_b = ... future_a = call(bind(async_write, A, data_a)) future_b = call(bind(async_write, B, data_b)) wait_all(future_a, future_b)

The C++ implementation of call, wait_any and wait_all, is similar to that of call_and_wait (in fact this last one will probably be implemented with the help of the other more basic operations). Awakeners can be still used, but care must be taken in the wait_all case to wake up the coroutine only when all pending operations have completed. The previous pseudocode would be written in C++:

stream_type A = ...; stream_type B = ...; buffer_type buffer_a = ...; buffer_type buffer_b = ...; ... future<size_t, bool> result_a = sched.call<size_t, bool>(bind(async_write, A, buffer_a)); future<size_t, bool> result_b = sched.call<size_t, bool>(bind(async_write, A, buffer_a)); sched.wait_all(result_a, result_b); assert(result_a); //convertible to bool, will be true here assert(result_b); //convertible to bool, will be true here size_t len; bool res; tie(len, res) = *result_a; //boost::optional interface.

In C++ the future will hold the result arguments that the async call will pass to the awakener. Note that is illegal not to capture the result of sched.call in a future (because the actual async call will be done inside the future, where the awakener is bound to the address of the result storage). A bound future object cannot be destroyed before it has been joined with a wait. Finally note that wait is a special yield point, in the same way that call_and_wait is.

Boost.Asio

The Asio asynchronous network library has been recently accepted into Boost. This library not only provides useful asynchronous I/O services, but also some very powerful and generic async patterns. Asio async functions have exactly the same interface expected by call and call_and_wait operations. The asio io_service works as the lowest level event pump needed by the coroutine scheduler, whose implementation is greatly simplified (not much more than a coroutine queue is need). Asio thread safety guarantees will also be handy when we will mix coroutines and threads.

Coroutines And Threads

In the "Coroutine or Threads" section we have discussed how threads and coroutines are orthogonal to each others, but we did not say anything about using coroutines in a threaded program. The argument will be discussed here and it will be shown how Boost.Asio makes it easy to use coroutines in a threaded program.

N on M Schedulers. The scheduler illustrated in the previous section, has many similarities to the scheduler of an operating system. In fact, on some systems, scheduling is implemented in a very similar manner, with a scheduler in the operating system kernel that schedules a relatively small number of kernel threads, plus a userspace scheduler that schedules a much larger number of coroutine-like object called userspace threads. This design is called the two-level scheduler or N on M scheduler, because it schedules M userspace threads on top of N kernel space threads. While this works very well on paper, in practice it is very hard to implement this model efficiently on real systems. The two schedulers need to take scheduling decisions together: the kernel must be able to preempt userspace threads, while the userspace scheduler must be able to switch threads from a busy CPU thread to another. The usual solution is to have the two schedulers communicate with each other with a scheduler activation interface. This makes for an extremely complex solution that is easy to break and whose interaction with the rest of the system is hard to understand at best.
While this solution has been widely used in the past, most recent versions of mainstream operating systems are moving to a purely kernel based threading model (BSD based systems seems to be the exception). The problem is trying to extend the exact same semantics of kernel threads to user threads: preemption, priority, same behavior with respect to the rest of the system, correct handling of blocking calls etc. Coroutines do not try to be threads and thus might be better suited for a simpler two level scheduler: a preemptive scheduler in the kernel and a cooperative scheduler in userspace, keeping the distinction between threads and coroutines.
Coroutines And Threads Yet to be written.
Concurrency Requirements of Coroutines. Yet to be written.

Conclusions

Coroutines are an useful addition to C++, that could open the possibilities of new interesting idioms not common in mainstream languages. While many cooperative userspace thread libraries are available for C++, most of them do not implement a real coroutine interface.

Your author intend to implement the interface shown in this article to produce a working coroutine library. Currently only an early prototype version is available in the sourceforge CVS. It has a different (and more limited) interface than that proposed here, but it demonstrates that coroutines can be implemented efficiently and can be made to work very well with the Boost.Asio library.

Appendix A: Coroutine Implementation

As the life time of subroutines is nested (i.e. the last called subroutine is the first to return) the most natural choice to implement them is using a single call stack, as most languages do. Coroutines in the other hand have no fixed lifetime and can call each other in dynamic patterns. This implies that multiple per-coroutine stacks must be used, and a way is needed to switch the CPU context from one stack to the other. In practice most operating systems already provide such a mechanism. Windows supports fibers, while UNIX variants provide the swapcontext family of functions. Where these APIs are not available usually the setjmp/longjump C routines can be hijacked to perform the required stack switch, as described in the Portable Multithreading paper from the GNU Pth developers.

The theoretically most problematic issue is exception handling. Almost all systems (Windows being the notable exception) do not define their exception handling mechanism, and altering the stack and registers might adversely influence the it. In practice, on most systems, the handling seems to be predictable and works well with context switching. Those few systems that break can be dealt with specific workarounds.

Appendix B: Coroutine Performance.

Context switching between coroutines usually requires saving all registers (instruction pointer included), switching the stack pointer, restoring the register and finally resuming execution at the new instruction pointer. In contrast subroutine calls are not required to save all registers, and the subroutine call instruction is usually heavily optimized on modern microprocessor architectures. The cost of a context switch is at least as much as a call trough a function pointer, that in many architectures defies branch prediction optimizations. Also switching the stack might be more expensive than a simple (stack pointer) register swap, as stack related instructions can be fast pathed and be slowed down by stack switching. The new stack might also not be in the CPU cache and thus require a very expensive cache miss to load it.

Even when considering these facts, coroutine performances seem very promising: it is too early to give hard estimates, but simple tests with the prototype implementation give a 300% penalty on a simple null scheduling test (that is, scheduling a set of empty coroutines through an asio demuxer), versus the same test done with empty function objects; when the coroutines actually do something (and thus the context switching overhead is amortized, like in a token passing test where many coroutines pass a token around with TCP/IP channels), the penalty is less than 5% versus the same test implemented with functions. The performance can be increased more by playing with exotic compiler optimization options than switching from coroutines to plain functions.

Other experiments have been done trying to eliminate system calls from the context swap path. For example most swapcontext implementations make a syscall to save and restore the sigmask. This is not necessary for coroutines because we do not give any guarantees for the signal mask. Modifying the source code of GNU glibc swapcontext not to perform the syscall increases raw context switching performance by about 30%. Portable syscall-free context switching can be done with _setjmp and _longjmp. Also experiments have been done with stack prefetching using specialized assembler instructions found on modern systems. Unfortunately those tests have proven inconclusive so far (probably because the tests do not stress the cache subsystem hard enough that catch locality makes a difference).

Appendix D: Article TODO List.

Expand the "Event Driven Model".
Finish the "Coroutines and Thread" Section.
Really test the examples shown and do not just pretend that they work.
Add an Index.
Add a Reference/Bibliography.
Proofread...
Proofread...
Proofread...
....

Appendix E: Changelog.

May 5 2005. First draft (0.9) released.

Giovanni P. Deretta
gpderetta at gmail dot com