1

First of all, some assumptions on which the question is based:

Synchronous IO: When I need to make a read IO operation, I perform a read system call on the file descriptor. The CPU goes into privileged mode and kernel code executes, which (via device driver) asks the device to retrieve my data and puts my thread into a BLOCKED state. Finally, a scheduler is ran and another thread takes the CPU core which my thread was running on.
The device processes the request on its own. Once it is done, it triggers an interrupt on the CPU. The interrupt handler executes kernel code which sets the status of my thread to READY (this is obviously a big simplification). My thread now has a chance to be scheduled when the kernel scheduler runs.

Asynchronous IO: I perform a system call, the system call asks the device to retrieve data. Now, instead of setting the thread state, the system call returns a special marker to indicate that the data is not yet ready. The thread continues it's execution.

We usually do not use such system call directly, but instead use some wrapper function (provided by the library) that takes a callback as a parameter. This library also spawns a thread that selects on file descriptors from all calls (epoll, kqueue ...). Once some fds can be interacted with this thread schedules an appropriate callback on some kind of thread pool of worker threads (which run event loops / task loops).

If some of the above is not right I am more than happy to be corrected!

Now, onto the questions:

1. Does asynchronous IO have any performance / resource benefits?
From my knowledge, in contrast to full context switch, switching between threads is quite inexpensive. The CPU will still be fully utilized if there is enough work (another thread will be scheduled).

Here are the things I can think of:

  • memory utilization - less threads is less memory allocated for stack and thread-related data structures in kernel code.
  • scheduling overhead - I guess scheduling of the threads by kernel might be quite complex.

But there are also some things that I think might hurt the performance with async IO:

  • We perform more syscalls in total (one for requesting an operation, another one for awaiting the result)
  • The callbacks need to be scheduled onto workers
  • Jumping to arbitrary locations when executing callbacks might mess with cache?

2. Does reactive programming / coroutines, which takes this idea even further (all code runs as events on the worker threads) have any performance benefit?

3. Why do we actually do reactive programming?

It really just seems to me that Reactive Programming builds an additional layer of abstraction on top of something that is already supposed to be an abstraction for developers to work on (processes and threads), which brings a lot of additional complexity. Sometimes it might seem to make sense, for example if we assumed that we want to have a separate UI thread. The problem with that is that from my perspective this pattern is basically an alternative approach to synchronization - we would be able to accomplish the same just firing up a thread that acquires UI lock.

I just fail to see what it is in the traditional approach to concurrency that has lead to creation of the reactive programming frameworks.

I will be really grateful for all explanations and sources which touch on this.

xsw2_2wsx
  • 162
  • 2
  • 11
  • FWIW, my #1 problem with the threaded/synchronous approach is that when a thread is blocked waiting for an I/O operation to complete, there is no good way for that thread to respond to anything else. It's effectively paralyzed until the operation completes. For filesystem I/O operations that's not usually a problem since filesytem operations can usually be relied upon to complete (or error out) quickly in any case, but for network I/O, there's no guarantee that any given reponse will arrive soon, or at all; so it's easy to end up with "stuck" threads that you can't use or get rid of. – Jeremy Friesner Apr 10 '23 at 14:54

2 Answers2

1

The device processes the request on its own

This is not so true in practice. Regarding the target device (as well as the OS and the driver implementation), the request may or may not be fully offloaded. In practice, a kernel thread is generally responsible for completing the request (in interaction with the IO scheduler). The actual code executed by this kernel thread and its actual behaviour is platform-independent (a HDD, a fast NVMe SSD and a HPC NIC will clearly not behave the same way). For example, a DMA request with a polling strategy can be used to alleviate the use of hardware interrupt for low-latency devices (since it is generally slow). Anyway, this operation is done in the OS side and users should not care much about this beside the impact on the CPU usage and on the latency/throughput. What matters is that requests are performed serially and the thread is de-scheduled during the IO operation.

I perform a system call, the system call asks the device to retrieve data. Now, instead of setting the thread state, the system call returns a special marker to indicate that the data is not yet ready. The thread continues it's execution.

The state of asynchronous IO is complex in practice and its implementation is also platform dependent. An API can provide asynchronous IO functions while the underlying OS do not support it. One common strategy is to span a progression thread polling the state of the IO request. This is not an efficient solution, but it can be better than synchronous IO regarding the actual application (explained later). In fact, the OS can even provide standard API for that while not fully supporting asynchronous IO in its own kernel so an intermediate layer is responsible to hide this discrepancies! On top of that, the target device also matter regarding the target platform.

One (old) way to do asynchronous IO is to do non-blocking IO combined with polling functions like select or poll. In this case, an application can start multiple requests and then wait for them. It can even do some useful computation before waiting for the completion of the target requests. This is significantly better than doing one request at a time, especially for high-latency IO request like waiting for a network message from Tokyo to Paris to be received (lasting for at least 32 ms due to the speed of light, but likely >100 ms in practice). That being said, there are several issues with this approach :

  • it is hard to overlap the latency with computation well (because of many unknown like the latency time, the computational speed, the amount of computation)
  • it poorly scale because each request is scanned when a request is ready (not to mention the number of descriptor is often limited and it use a lot more OS resources than what it should).
  • it makes application less maintainable due to polling loops. In many cases, this polling loops is put in a separate thread (or even a pool of threads) at the expense of a higher latency (due to additional context switches and cache misses). This strategy can actually be the one implemented by asynchronous IO libraries.

In order to solve there issues, event-based asynchronous IO functions can be used instead. A good example is epoll (more specifically the edge-triggered interface). It is meant to solve the useless scan of many waiting request and only focus on the one that are ready. As a result, it scale better (O(n) time VS O(1) for epoll). There is no need of any active probing loop but an event-based code doing similar things. This part can be hidden by user-side high-level software libraries. In fact, software libraries are also critical to write portable code since OS have different asynchronous interfaces. For example, epoll is only for Linux, kqueue is for BSD and Windows also use another method (see here for more information). Also, one thing to keep in mind is that epoll_wait is a blocking call so while there can be more than one request pending, there is still a final synchronous wait operation. Putting it in a thread to make this operation from the user point-of-view can typically decrease performance (mainly latency).

On POSIX systems, there is the AIO API specifically designed for asynchronous operation (based on callbacks). That being said, the standard Linux implementation of AIO emulates asynchronous IOs using threads internally because the kernel did not have any fully asynchronous compatible interface to do that until recently. In the end, this is not much better than using threads yourself to process asynchronous IO requests. In fact, AIO can be slower because it performs more kernel calls. Fortunately, Linux recently introduced a new kernel interface for asynchronous IOs : io_uring. This new interface is the best on Linux. It is not meant to be used directly (as it is very low-level). For more information on the difference between AIO and io_uring, please read this. Note that io_uring is pretty new so AFAIK it is not used by many high-level libraries yet.

In the end, an asynchronous call from a high-level library can result in several system calls or context switches. When used, completion threads can also have a strong impact on the CPU usage, the latency of the operation, cache misses, etc. This is why asynchronous IO is not always so great in practice performance-wise not to mention asynchronous IO often require the target application to be implemented pretty differently.

Does asynchronous IO have any performance / resource benefits?

This is dependent of the use case but asynchronous IO can drastically improve the performance of a wide range of applications. Actually, all applications able to start multiple requests simultaneously can benefit from asynchronous IO, especially when the target requests last for a while (HDD request, network ones, etc). If you are working with high-latency devices, this is the key point and you can forget about all other overheads which are negligible (eg. a seek time of an HDD last about a dozen of milliseconds while context switches generally last a few microseconds, that is at least 2 orders of magnitude less). For low-latency devices, the story is more complex because the many overheads may not be negligible : the best is to try on your specific platform.

As for the provided points that might hurt performance, they are dependent of the underlying interface used and possibly the device (and so the platform). For example nothing force the implementation to call callbacks on different threads. The point about cache misses caused by the callback are probably the least of your problem after doing a system call that is far more expensive not to mention modern CPUs have pretty big caches nowadays. Thus, unless you have a very large set of callbacks to call or very large callback codes, you should not see a statistically significant performance impact due to this point. With interfaces like io_uring, the number of system calls is not really a problem anymore. In fact, AFAIK, io_uring will likely perform better than all other interface. For example, you can create a chain of IO operations avoiding some callbacks and ping-pong between the user application and the kernel. Besides, io_uring_enter can wait for an IO request and submit a new one at the same time.

Does reactive programming / coroutines, which takes this idea even further (all code runs as events on the worker threads) have any performance benefit?

With coroutines nothing is run in a separate system thread. This is a common misunderstanding. Coroutines are a function that can be paused. The pause is based on a continuation mechanism : registers including the code pointer are temporary stored in memory (pause) so they can be restored back later (restart). Such an operation happens in the same thread. Coroutines typically also have their own stack. Coroutines are similar to fibers.

Mechanisms similar to coroutines (continuations) are used to implement asynchronous functions in programming languages (some may argue that they are actually coroutines). For example, async/await in C# (and many other languages) do that. In this case, an asynchronous function can start an IO request and be paused when it is waiting on it so another asynchronous function can start other IO request until there is no asynchronous function to run. The language runtime can then wait for IO requests to be completed so to then restart the target asynchronous functions that was awaiting for the read request. Such a mechanism makes asynchronous programming much easier. It is not meant to make things fast (despite using asynchronous IO). In fact, coroutines/continuations have a slight overhead so it can be slower than using low-level asynchronous API, but the overhead is generally much smaller than than the one of the IO request latency and even generally smaller than the one of a context switch (or even a system call).

I am not very familiar with reactive programming but AFAIK it is meant to simplify the implementation of programs having a large set of dependent operation with incremental updates. This seems pretty orthogonal to asynchronous programming to me. A good implementation can benefit from asynchronous operations but this is not the main goal of this approach. The benefit of the approach is to only update the things that needs to be updated in a declarative way, no more. Incremental updates are critical for performance as recomputing the whole dataset can be much more expensive than a small part regarding the target application. This is especially true in GUIs.

One thing to keep in mind is that asynchronous programming can improve performance thanks to concurrency, but this is only useful if the latency of the target asynchronous operation can be mitigated. Making a compute-bound code concurrent is useless performance-wise (and actually certainly even detrimental due to the concurrency overhead) because there is no latency issue (assuming you are not operating at the granularity of CPU instructions).

Why do we actually do reactive programming?

As said above, it is a good solution to perform incremental updates of a complex dataflow. Not all applications benefit from this.

Programming models are like tools : developers need to pick the best one so to address the specific problems of a given application. Otherwise, this is the recipe for a disaster. Unfortunately, this is not rare for people to use programming models not well suited for their needs. There are many reasons for this (historical, psychological, technical, etc.) but this is a too broad topic and this answer is already pretty big.

Note that using threads to do asynchronous operations is generally not a great idea. This is one way to implement asynchronous programming, and not an efficient one (especially without a thread-pool). It often introduces more issues than it solves. For example, you may need to protect variables with locks (or any synchronization mechanism) to avoid race conditions; care about (low-level) operations that cannot be executed on a separate threads; consider the overheads of the TLS, the ones due to cache misses, inter-core communications, possible context switches and NUMA effects (not to mention the target cores can be sleeping, operating at a lower frequency, etc.).


Related post:

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59
1

Does asynchronous IO have any performance / resource benefits?

In a way. As you noted, asynchronous code can often be slower than synchronous code. There's more overhead in terms of setting up the callback structures and whatnot.

However, asynchronous code is more scalable, precisely because it doesn't block threads unnecessarily. Experiments on web servers running real-world-ish code showed a significant increase in scalability when switching from synchronous to asynchronous code.

In summary, asynchronous code isn't about performance, but about scalability.

Why do we actually do reactive programming?

Reactive programming is quite different. Asynchronous code is still pull-based; i.e., your app requests some I/O operation, and some time later that operation completes. Reactive code is push-based; a more natural example would be something like a listening socket or a WebSocket connection that can push commands at any time.

With reactive code, the code defines how it reacts to incoming events. The structure of the code is more declarative rather than imperative. Reactive frameworks have a way to declare how to react to events, "subscribe" to those events, and then "unsubscribe" from the events when done.

It's possible to structure asynchronous code as reactive (the I/O request is the "subscription", there is only one event which is the completion of that request, followed by an unsubscription). But this is not normally done for all asynchronous code; it's only normal if there's already a significant amount of code using the declarative reactive pattern style and that code wants to do asynchronous code while keeping the same style.

Any asynchronous code can be written in a reactive style, but that's not normally done for complexity/maintainability reasons. Reactive code tends to be more difficult to understand and maintain.

Stephen Cleary
  • 437,863
  • 77
  • 675
  • 810