The device processes the request on its own
This is not so true in practice. Regarding the target device (as well as the OS and the driver implementation), the request may or may not be fully offloaded. In practice, a kernel thread is generally responsible for completing the request (in interaction with the IO scheduler). The actual code executed by this kernel thread and its actual behaviour is platform-independent (a HDD, a fast NVMe SSD and a HPC NIC will clearly not behave the same way). For example, a DMA request with a polling strategy can be used to alleviate the use of hardware interrupt for low-latency devices (since it is generally slow). Anyway, this operation is done in the OS side and users should not care much about this beside the impact on the CPU usage and on the latency/throughput. What matters is that requests are performed serially and the thread is de-scheduled during the IO operation.
I perform a system call, the system call asks the device to retrieve data. Now, instead of setting the thread state, the system call returns a special marker to indicate that the data is not yet ready. The thread continues it's execution.
The state of asynchronous IO is complex in practice and its implementation is also platform dependent. An API can provide asynchronous IO functions while the underlying OS do not support it. One common strategy is to span a progression thread polling the state of the IO request. This is not an efficient solution, but it can be better than synchronous IO regarding the actual application (explained later). In fact, the OS can even provide standard API for that while not fully supporting asynchronous IO in its own kernel so an intermediate layer is responsible to hide this discrepancies! On top of that, the target device also matter regarding the target platform.
One (old) way to do asynchronous IO is to do non-blocking IO combined with polling functions like select
or poll
. In this case, an application can start multiple requests and then wait for them. It can even do some useful computation before waiting for the completion of the target requests. This is significantly better than doing one request at a time, especially for high-latency IO request like waiting for a network message from Tokyo to Paris to be received (lasting for at least 32 ms due to the speed of light, but likely >100 ms in practice). That being said, there are several issues with this approach :
- it is hard to overlap the latency with computation well (because of many unknown like the latency time, the computational speed, the amount of computation)
- it poorly scale because each request is scanned when a request is ready (not to mention the number of descriptor is often limited and it use a lot more OS resources than what it should).
- it makes application less maintainable due to polling loops. In many cases, this polling loops is put in a separate thread (or even a pool of threads) at the expense of a higher latency (due to additional context switches and cache misses). This strategy can actually be the one implemented by asynchronous IO libraries.
In order to solve there issues, event-based asynchronous IO functions can be used instead. A good example is epoll
(more specifically the edge-triggered interface). It is meant to solve the useless scan of many waiting request and only focus on the one that are ready. As a result, it scale better (O(n)
time VS O(1)
for epoll
). There is no need of any active probing loop but an event-based code doing similar things. This part can be hidden by user-side high-level software libraries. In fact, software libraries are also critical to write portable code since OS have different asynchronous interfaces. For example, epoll
is only for Linux, kqueue
is for BSD and Windows also use another method (see here for more information). Also, one thing to keep in mind is that epoll_wait
is a blocking call so while there can be more than one request pending, there is still a final synchronous wait operation. Putting it in a thread to make this operation from the user point-of-view can typically decrease performance (mainly latency).
On POSIX systems, there is the AIO API specifically designed for asynchronous operation (based on callbacks). That being said, the standard Linux implementation of AIO emulates asynchronous IOs using threads internally because the kernel did not have any fully asynchronous compatible interface to do that until recently. In the end, this is not much better than using threads yourself to process asynchronous IO requests. In fact, AIO can be slower because it performs more kernel calls. Fortunately, Linux recently introduced a new kernel interface for asynchronous IOs : io_uring
. This new interface is the best on Linux. It is not meant to be used directly (as it is very low-level). For more information on the difference between AIO and io_uring, please read this. Note that io_uring is pretty new so AFAIK it is not used by many high-level libraries yet.
In the end, an asynchronous call from a high-level library can result in several system calls or context switches. When used, completion threads can also have a strong impact on the CPU usage, the latency of the operation, cache misses, etc. This is why asynchronous IO is not always so great in practice performance-wise not to mention asynchronous IO often require the target application to be implemented pretty differently.
Does asynchronous IO have any performance / resource benefits?
This is dependent of the use case but asynchronous IO can drastically improve the performance of a wide range of applications. Actually, all applications able to start multiple requests simultaneously can benefit from asynchronous IO, especially when the target requests last for a while (HDD request, network ones, etc). If you are working with high-latency devices, this is the key point and you can forget about all other overheads which are negligible (eg. a seek time of an HDD last about a dozen of milliseconds while context switches generally last a few microseconds, that is at least 2 orders of magnitude less). For low-latency devices, the story is more complex because the many overheads may not be negligible : the best is to try on your specific platform.
As for the provided points that might hurt performance, they are dependent of the underlying interface used and possibly the device (and so the platform).
For example nothing force the implementation to call callbacks on different threads. The point about cache misses caused by the callback are probably the least of your problem after doing a system call that is far more expensive not to mention modern CPUs have pretty big caches nowadays. Thus, unless you have a very large set of callbacks to call or very large callback codes, you should not see a statistically significant performance impact due to this point.
With interfaces like io_uring, the number of system calls is not really a problem anymore. In fact, AFAIK, io_uring will likely perform better than all other interface. For example, you can create a chain of IO operations avoiding some callbacks and ping-pong between the user application and the kernel. Besides, io_uring_enter
can wait for an IO request and submit a new one at the same time.
Does reactive programming / coroutines, which takes this idea even further (all code runs as events on the worker threads) have any performance benefit?
With coroutines nothing is run in a separate system thread. This is a common misunderstanding. Coroutines are a function that can be paused. The pause is based on a continuation mechanism : registers including the code pointer are temporary stored in memory (pause) so they can be restored back later (restart). Such an operation happens in the same thread. Coroutines typically also have their own stack. Coroutines are similar to fibers.
Mechanisms similar to coroutines (continuations) are used to implement asynchronous functions in programming languages (some may argue that they are actually coroutines). For example, async
/await
in C# (and many other languages) do that. In this case, an asynchronous function can start an IO request and be paused when it is waiting on it so another asynchronous function can start other IO request until there is no asynchronous function to run. The language runtime can then wait for IO requests to be completed so to then restart the target asynchronous functions that was awaiting for the read request. Such a mechanism makes asynchronous programming much easier. It is not meant to make things fast (despite using asynchronous IO). In fact, coroutines/continuations have a slight overhead so it can be slower than using low-level asynchronous API, but the overhead is generally much smaller than than the one of the IO request latency and even generally smaller than the one of a context switch (or even a system call).
I am not very familiar with reactive programming but AFAIK it is meant to simplify the implementation of programs having a large set of dependent operation with incremental updates. This seems pretty orthogonal to asynchronous programming to me. A good implementation can benefit from asynchronous operations but this is not the main goal of this approach. The benefit of the approach is to only update the things that needs to be updated in a declarative way, no more. Incremental updates are critical for performance as recomputing the whole dataset can be much more expensive than a small part regarding the target application. This is especially true in GUIs.
One thing to keep in mind is that asynchronous programming can improve performance thanks to concurrency, but this is only useful if the latency of the target asynchronous operation can be mitigated. Making a compute-bound code concurrent is useless performance-wise (and actually certainly even detrimental due to the concurrency overhead) because there is no latency issue (assuming you are not operating at the granularity of CPU instructions).
Why do we actually do reactive programming?
As said above, it is a good solution to perform incremental updates of a complex dataflow. Not all applications benefit from this.
Programming models are like tools : developers need to pick the best one so to address the specific problems of a given application. Otherwise, this is the recipe for a disaster. Unfortunately, this is not rare for people to use programming models not well suited for their needs. There are many reasons for this (historical, psychological, technical, etc.) but this is a too broad topic and this answer is already pretty big.
Note that using threads to do asynchronous operations is generally not a great idea. This is one way to implement asynchronous programming, and not an efficient one (especially without a thread-pool). It often introduces more issues than it solves. For example, you may need to protect variables with locks (or any synchronization mechanism) to avoid race conditions; care about (low-level) operations that cannot be executed on a separate threads; consider the overheads of the TLS, the ones due to cache misses, inter-core communications, possible context switches and NUMA effects (not to mention the target cores can be sleeping, operating at a lower frequency, etc.).
Related post: