Is there really no asynchronous block I/O on Linux?

Question

Consider an application that is CPU bound, but also has high-performance I/O requirements.

I'm comparing Linux file I/O to Windows, and I can't see how epoll will help a Linux program at all. The kernel will tell me that the file descriptor is "ready for reading," but I still have to call blocking read() to get my data, and if I want to read megabytes, it's pretty clear that that will block.

On Windows, I can create a file handle with OVERLAPPED set, and then use non-blocking I/O, and get notified when the I/O completes, and use the data from that completion function. I need to spend no application-level wall-clock time waiting for data, which means I can precisely tune my number of threads to my number of cores, and get 100% efficient CPU utilization.

If I have to emulate asynchronous I/O on Linux, then I have to allocate some number of threads to do this, and those threads will spend a little bit of time doing CPU things, and a lot of time blocking for I/O, plus there will be overhead in the messaging to/from those threads. Thus, I will either over-subscribe or under-utilize my CPU cores.

I looked at mmap() + madvise() (WILLNEED) as a "poor man's async I/O" but it still doesn't get all the way there, because I can't get a notification when it's done -- I have to "guess" and if I guess "wrong" I will end up blocking on memory access, waiting for data to come from disk.

Linux seems to have the starts of async I/O in io_submit, and it seems to also have a user-space POSIX aio implementation, but it's been that way for a while, and I know of nobody who would vouch for these systems for critical, high-performance applications.

The Windows model works roughly like this:

Issue an asynchronous operation.
Tie the asynchronous operation to a particular I/O completion port.
Wait on operations to complete on that port
When the I/O is complete, the thread waiting on the port unblocks, and returns a reference to the pending I/O operation.

Steps 1/2 are typically done as a single thing. Steps 3/4 are typically done with a pool of worker threads, not (necessarily) the same thread as issues the I/O. This model is somewhat similar to the model provided by boost::asio, except boost::asio doesn't actually give you asynchronous block-based (disk) I/O.

The difference to epoll in Linux is that in step 4, no I/O has yet happened -- it hoists step 1 to come after step 4, which is "backwards" if you know exactly what you need already.

Having programmed a large number of embedded, desktop, and server operating systems, I can say that this model of asynchronous I/O is very natural for certain kinds of programs. It is also very high-throughput and low-overhead. I think this is one of the remaining real shortcomings of the Linux I/O model, at the API level.

I don't know the MS Windows model at all, so I can't compare, but I would just point out that if you are using any form of `select`/`poll`/`epoll`/`kqueue` then it would be VERY unusual to follow up with blocking `read`/`write` when you get a notification that a file descriptor is ready. You almost certainly want to do a non-blocking `read` or `write` there. — Celada, Nov 17 '12 at 04:49
select() was invented for sockets, together with the recv() and send() system calls, that guarantee to not block if select() returns them as ready -- with the draw-back that the amount of I/O is not guaranteed. You may get only a few bytes. The problem is that you can't be reactive with this model. Non-blocking I/O is not efficient, because it requires the kernel to pre-fetch "some amount" of data, and the kernel has no idea how much you will need. If you need just a page, and the kernel fetches a megabyte, you lose. If you need a megabyte, and the kernel fetches a single page, you also lose. — Jon Watte, Nov 19 '12 at 01:15
possible duplicate of [Linux Disk File AIO](http://stackoverflow.com/questions/8513663/linux-disk-file-aio) — J-16 SDiZ, Nov 20 '12 at 05:57
@JonWatte I believe that you would issue a modified read and ask for a certain amount of bytes, and then get notified when the buffer is filled. — LtWorf, Aug 16 '15 at 09:37
@LtWorf: That is the asynchronous "read()" model. I'm looking for the asynchronous (not non-blocking) "recv()" model. — Jon Watte, Aug 17 '15 at 15:38
With an adaptive-sized threadpool you won't have a lot of blocking thread, only a small number of them doing real work. — peterh, Sep 05 '15 at 01:50

Anon · Answer 1 · 2022-08-31T20:27:25.410

(2020) If you're using a 5.1 or above Linux kernel you can use the io_uring interface for file-like I/O and obtain excellent asynchronous operation.

Compared to the existing libaio/KAIO interface, io_uring has the following advantages:

Retains asynchronous behaviour when doing buffered I/O (and not just when doing direct I/O)
Easier to use (especially when using the liburing helper library)
Can optionally work in a polled manner (but you'll need higher privileges to enable this mode)
Less bookkeeping space overhead per I/O
Lower CPU overhead due to fewer userspace/kernel syscall mode switches (a big deal these days due to the impact of spectre/meltdown mitigations)
File descriptors and buffers can be pre-registered to save mapping/unmapping time
Faster (can achieve higher aggregate throughput, I/Os have a lower latency)
"Linked mode" can express dependencies between I/Os (>=5.3 kernel)
Can work with socket based I/O (recvmsg()/sendmsg() are supported from >=5.3, see messages mentioning the word support in io_uring.c's git history)
Supports attempted cancellation of queued I/O (>=5.5)
Can request that I/O always be performed from asynchronous context rather than the default of only falling back to punting I/O to an asynchronous context when the inline submission path triggers blocking (>=5.6 kernel)
Growing support for performing asynchronous operations beyond read/write (e.g. fsync (>=5.1), fallocate (>=5.6), splice (>=5.7) and more)
Higher development momentum
Doesn't become blocking each time the stars aren't perfectly aligned

Compared to glibc's POSIX AIO, io_uring has the following advantages:

Much faster and more efficient (the lower overhead benefits from above apply even more here)
Interface is kernel backed and DOESN'T use a userspace thread pool
Less copies of the data are made when doing buffered I/O
No wrestling with signals
Glibc's POSIX AIO can't have more than one I/O in flight on a single file descriptor whereas io_uring most certainly can!

The Efficient IO with io_uring document goes into far more detail as to io_uring's benefits and usage. The What's new with io_uring document describes new features added to io_uring between the 5.2 - 5.5 kernels, while The rapid growth of io_uring LWN article describes which features were available in each of the 5.1 - 5.5 kernels with a forward glance to what was going to be in 5.6 (also see LWN's list of io_uring articles). There's also a Faster IO through io_uring Kernel Recipes videoed presentation (slides) from late 2019 and What’s new with io_uring Kernel Recipes videoed presentation (slides) from mid 2022 by io_uring author Jens Axboe. Finally, the Lord of the io_uring tutorial gives an introduction to io_uring usage.

The io_uring community can be reached via the io_uring mailing list and the io_uring mailing list archives show daily traffic at the start of 2021.

Re "support partial I/O in the sense of recv() vs read()": a patch went into the 5.3 kernel that will automatically retry io_uring short reads and a further commit went into the 5.4 kernel that tweaks the behaviour to only automatically take care of short reads when working with "regular" files on requests that haven't set the REQ_F_NOWAIT flag (it looks like you can request REQ_F_NOWAIT via IOCB_NOWAIT or by opening the file with O_NONBLOCK). Thus you can get recv() style- "short" I/O behaviour from io_uring too.

Software/projects using `io_uring`

Though the interface is young (its first incarnation arrived in May 2019), some open-source software is using io_uring "in the wild":

fio (which is also authored by Jens Axboe) has an io_uring ioengine backend (in fact it was introduced back in fio-3.13 from February 2019!). The "Improved Storage Performance Using the New Linux Kernel I/O Interface SNIA presentation" (slides) by two Intel engineers states they were able to get double the IOPS on one workload and less than half the average latency at a queue depth of 1 on another workload when comparing the io_uring ioengine to the libaio ioengine on an Optane device.
The SPDK project added support for using io_uring (!) for block device access in its v19.04 release (but obviously this isn't the backend you'd typically use SPDK for other than benchmarking). More recently, they also seem to have added support for using it with sockets in v20.04...
Ceph committed an io_uring backend in Dec 2019 which was part of its 15.1.0 release. The commit author posted a github comment showing some io_uring backend has some wins and losses versus the libaio backend (in terms of IOPS, bandwidth and latency) depending on workload.
RocksDB committed an io_uring backend for MultiRead in Dec 2019 and was part of its 6.7.3 release. Jens states io_uring helped to dramatically cut latency.
libev released 4.31 with an initial io_uring backend in Dec 2019. While some of the author's original points were addressed in newer kernels, at the time of writing (mid 2021) libev's author has some choice words about io_uring's maturity and is taking a wait-and-see approach before implementing further improvements.
QEMU committed an io_uring backend in Jan 2020 and was part of the QEMU 5.0 release. In the "io_uring in QEMU: high-performance disk IO for Linux" PDF presentation Julia Suvorova shows the io_uring backend outperforming the threads and aio backends on one workload of random 16K blocks.
Samba merged an io_uring VFS backend in Feb 2020 and was part of the Samba 4.12 release. In the "Linux io_uring VFS backend." Samba mailing list thread, Stefan Metzmacher (the commit author) says the io_uring module was able to push roughly 19% more throughput (compared to some unspecified backend) in a synthetic test. You can also read the "Async VFS Future" PDF presentation by Stefan for some of the motivation behind the changes.
Facebook's experimental C++ libunifex uses it (but you will also need a 5.6+ kernel)
The rust folk have been writing wrappers to make io_uring more accessible to pure rust. rio is one library talked about a bit and the author says they achieved higher throughput compared to using sync calls wrapped in threads. The author gave a presentation about his database and library at FOSDEM 2020 which included a section extolling the virtues of io_uring.
The rust library Glommio exclusively uses io_uring. The author (Glauber Costa) published a document called Modern storage is plenty fast. It is the APIs that are bad showing that with careful tuning Glommio could get over 2.5 times the performance over regular (non-io_uring) syscalls when performing sequential I/O on an Optane device.
Gluster merged an io_uring posix xlator in Oct 2020 and was part of the Gluster 9.0 release. The commit author mentions performance was "not any worse than regular pwrite/pread syscalls".

Software investigating using `io_uring`

PostgreSQL developer Andres Freund has been one of the driving forces behind io_uring improvements (e.g. the workaround to reduce for filesystem inode contention). There is a presentation "Asynchronous IO for PostgreSQL" (be aware the video is broken until the 5 minute mark) (PDF) motivating the need for PostgreSQL changes and demonstrating some experimental results. He has expressed hope of getting his optional io_uring support into PostgreSQL 14 and seems acutely aware of what does and doesn't work even down to the kernel level. In December 2020, Andres further discusses his PostgreSQL io_uring work in the "Blocking I/O, async I/O and io_uring" pgsql-hackers mailing list thread and mentions the work in progress can be seen over in https://github.com/anarazel/postgres/tree/aio .
The Netty project has an incubator repo working on io_uring support which needs a 5.9 kernel
libuv has a pull request against it adding io_uring support but its progress into the project has been slow
SwiftNIO added io_uring support for eventing (but not syscalls) in April 2020 and the Linux: full io_uring I/O issue outlines plans to integrate it further
The Tokio Rust project has developed a proof of concept tokio-uring

Linux distribution support for `io_uring`

(Late 2020) Ubuntu 18.04's latest HWE enablement kernel is 5.4 so io_uring syscalls can be used. This distro doesn't pre-package the liburing helper library but you can build it for yourself.
Ubuntu 20.04's initial kernel is 5.4 so io_uring syscalls can be used. As above, the distro doesn't pre-package liburing.
Fedora 32's initial kernel is 5.6 and it has a packaged liburing so io_uring is usable.
SLES 15 SP2 has a 5.3 kernel so io_uring syscalls can be used. This distro doesn't pre-package the liburing helper library but you can build it for yourself.
(Mid 2021) RHEL 8's default kernel does not support io_uring (a previous version of this answer mistakenly said it did). There is an Add io_uring support Red Hat knowledge base article (contents is behind a subscriber paywall) that is "in progress".
(Mid 2022) RHEL 9's default kernel does not support io_uring. The kernel is new enough (5.14) but support for io_uring is explicitly disabled.

Hopefully io_uring will usher in a better asynchronous file-like I/O story for Linux.

(To add a thin veneer of credibility to this answer, at some point in the past Jens Axboe (Linux kernel block layer maintainer and inventor of io_uring) thought this answer might be worth upvoting :-)

That looks quite awesome, now one just needs the mainstream "usable" distros to use the 5.1 kernel too, which I guess will, depending on which distro, probably be somewhere between 2020 and 2022 (Ubuntu and Fedora being at 5.0 presently, Debian at 4.19, SuSE at 4.12). Looking forward, this will be a great addition. — Damon, Aug 12 '19 at 09:16
Both Fedora (https://fedoraproject.org/wiki/Kernel_Vanilla_Repositories ) and Ubuntu (https://wiki.ubuntu.com/Kernel/MainlineBuilds ) offer extension repos that contain **vanilla** kernels up 5.3 (but obviously you enable any extra repos at your own risk etc). SUSE also seems to have something (https://software.opensuse.org/package/kernel-vanilla ) but it's not clear what's inside. I'd guess Ubuntu 20.10 would likely have a suitable a kernel and Fedora rebases kernels semi-frequently so Fedora 30 is highly likely to get a post 5.1 kernel (which means Fedora 31 will certainly have something). — Anon, Aug 12 '19 at 14:36

score 18 · Accepted Answer · answered Feb 20 '14 at 17:39

18

The real answer, which was indirectly pointed to by Peter Teoh, is based on io_setup() and io_submit(). Specifically, the "aio_" functions indicated by Peter are part of the glibc user-level emulation based on threads, which is not an efficient implementation. The real answer is in:

io_submit(2)
io_setup(2)
io_cancel(2)
io_destroy(2)
io_getevents(2)

Note that the man page, dated 2012-08, says that this implementation has not yet matured to the point where it can replace the glibc user-space emulation:

http://man7.org/linux/man-pages/man7/aio.7.html

this implementation hasn't yet matured to the point where the POSIX AIO implementation can be completely reimplemented using the kernel system calls.

So, according to the latest kernel documentation I can find, Linux does not yet have a mature, kernel-based asynchronous I/O model. And, if I assume that the documented model is actually mature, it still doesn't support partial I/O in the sense of recv() vs read().

answered Feb 20 '14 at 17:39

Jon Watte

6,579
4
53
63

In linux kernel, the philosophy is that nothing is "stable" and mature, as the community is always ready to change its API when necessary. But anything published in userspace like http://man7.org man pages, the kernel have to follow and never break userspace. That statement is from Linus himself. But if you say now that kernel (read here: http://lxr.free-electrons.com/source/include/linux/syscalls.h line 474) has the syscall API, then we can always use syscall() to call the kernel directly, irregardless whether io_submit() (and its family) API has been implemented in glibc. – Peter Teoh Feb 22 '14 at 08:11
2

The problem I have with your answer is that you recommend the "aio_xxx()" functions API as "the async I/O API." However, that API is not actually backed by the Linux syscalls, but instead implemented using user-space threads in glibc. At least, according to the man page. Thus, the "aio_xxx()" functions are not an answer to the question -- they are the cause for the question. The right answer is the functions starting with "io_setup()". – Jon Watte Feb 23 '14 at 18:15
By aio_xxx() all along I am referring to kernel API. So of course there is no "Linux syscall" standards for it, neither is there any manpages for it (meaning the official Linux API), which is strictly ONLY for userspace. In general, all kernel API does not go into any "standard API" list, as it is always said (in mailing list, eg, https://lkml.org/lkml/2006/6/14/164) that kernel API are quite fluid, but neither is changes so easily approved either. – Peter Teoh Feb 24 '14 at 01:46

score 3 · Answer 3 · edited May 23 '17 at 10:30

As explained in:

http://code.google.com/p/kernel/wiki/AIOUserGuide

and here:

http://www.ibm.com/developerworks/library/l-async/

Linux does provide async block I/O at the kernel level, APIs as follows:

aio_read    Request an asynchronous read operation
aio_error   Check the status of an asynchronous request
aio_return  Get the return status of a completed asynchronous request
aio_write   Request an asynchronous operation
aio_suspend Suspend the calling process until one or more asynchronous requests have completed (or failed)
aio_cancel  Cancel an asynchronous I/O request
lio_listio  Initiate a list of I/O operations

And if you asked who are the users of these API, it is the kernel itself - just a small subset is shown here:

./drivers/net/tun.c (for network tunnelling):
static ssize_t tun_chr_aio_read(struct kiocb *iocb, const struct iovec *iv,

./drivers/usb/gadget/inode.c:
ep_aio_read(struct kiocb *iocb, const struct iovec *iov,

./net/socket.c (general socket programming):
static ssize_t sock_aio_read(struct kiocb *iocb, const struct iovec *iov,

./mm/filemap.c (mmap of files):
generic_file_aio_read(struct kiocb *iocb, const struct iovec *iov,

./mm/shmem.c:
static ssize_t shmem_file_aio_read(struct kiocb *iocb,

etc.

At the userspace level, there is also the io_submit() etc API (from glibc), but the following article offer an alternative to using glibc:

http://www.fsl.cs.sunysb.edu/~vass/linux-aio.txt

It directly implement the API for functions like io_setup() as direct syscall (bypassing glibc dependencies), a kernel mapping via the same "__NR_io_setup" signature should exist. Upon searching the kernel source at:

http://lxr.free-electrons.com/source/include/linux/syscalls.h#L474 (URL is applicable for the latest version 3.13) you are greeted with the direct implementation of these io_*() API in the kernel:

474 asmlinkage long sys_io_setup(unsigned nr_reqs, aio_context_t __user *ctx);
475 asmlinkage long sys_io_destroy(aio_context_t ctx);
476 asmlinkage long sys_io_getevents(aio_context_t ctx_id,
481 asmlinkage long sys_io_submit(aio_context_t, long,
483 asmlinkage long sys_io_cancel(aio_context_t ctx_id, struct iocb __user *iocb,

The later version of glibc should make these usage of "syscall()" to call sys_io_setup() unnecessary, but without the latest version of glibc, you can always make these call yourself if you are using the later kernel with these capabilities of "sys_io_setup()".

Of course, there are other userspace option for asynchronous I/O (eg, using signals?):

http://personal.denison.edu/~bressoud/cs375-s13/supplements/linux_altIO.pdf

or perhap:

What is the status of POSIX asynchronous I/O (AIO)?

"io_submit" and friends are still not available in glibc (see io_submit manpages), which I have verified in my Ubuntu 14.04, but this API is linux-specific.

Others like libuv, libev, and libevent are also asynchronous API:

http://nikhilm.github.io/uvbook/filesystem.html#reading-writing-files

http://software.schmorp.de/pkg/libev.html

http://libevent.org/

All these API aimed to be portable across BSD, Linux, MacOSX, and even Windows.

In terms of performance I have not seen any numbers, but suspect libuv may be the fastest, due to its lightweightedness?

https://ghc.haskell.org/trac/ghc/ticket/8400

Thanks for the answer. It seems aio_suspend() cannot mix with event or select style I/O, though. For example: Is there no aio version of recv()? — Jon Watte, Feb 18 '14 at 17:31
not really. If you google "asynchronous recv" (without the quotes) you will get many answer, eg this one: http://www.wangafu.net/~nickm/libevent-book/01_intro.html. Within it did provide an example of non-blocking recv(), through the use of "fnctl()". but then it highlight that now the program spin very fast into a loop, using up all the CPU cycles. select() and libevent API are the alternatives proposed within to solve the above problem. — Peter Teoh, Feb 20 '14 at 01:22
I am very familiar with non-blocking socket I/O. Non-blocking is not the same as asynchronous. And read is not the same as recv. What I'm looking for is a way to push all I/O through the same "queue a request, then later get notification about what queued requests have completed" mechanism for all I/O, like I/O completion ports on Windows. Just like I stated in the question ;-) — Jon Watte, Feb 20 '14 at 17:22
A function that was missing in the above list is io_getevents(). With that addition, the set of primitives starts looking more like what I expect. Now, if there was a clear code example from some robust user-level application using this, that'd make my day! — Jon Watte, Feb 20 '14 at 17:29
Actually, reading more, this is not a true answer: "The current Linux POSIX AIO implementation is provided in user space by glibc." The right answer is based on io_setup() and io_submit(). — Jon Watte, Feb 20 '14 at 17:34
the userspace implementation of aio scales terribly but is generally good enough for simple uses. For better results you can increase the number of threads with aio_init (glibc specific), and to get a completion notification in a poll/ppoll/epoll thread you can either use signal-based notifications and either signalfd or deal with the interruption to the call through an atomic variable in the signal handler. despite being a userspace implementation, it *is* asynchronous — nfries88, Jan 20 '23 at 07:16

J-16 SDiZ · Answer 4 · 2012-11-19T01:48:40.733

1

For network socket i/o, when it is "ready", it don't block. That's what the O_NONBLOCK and "ready" means.

For disk i/o, we have posix aio, linux aio, sendfile and friends.

edited Nov 19 '12 at 01:48

answered Nov 19 '12 at 01:41

J-16 SDiZ

26,473
4
65
84

1

posix aio is implemented in userland in glibc, using threads, so no, it's not true AIO. Linux AIO (io_submit) is something I want to hear more about, but I've seen nobody actually use it for anything, which to me means there be dragons there. That's part of what this question is trying to suss out. sendfile() has nothing to do with asynchronous disk-based I/O. I'd be happy to accept your answer if it actually contributes to a solution -- but notice that I already mentioned io_submit in my question. – Jon Watte Nov 20 '12 at 02:52
7

Linux AIO is not unused... For example, InnoDB (MySQL) use `io_submit`. – J-16 SDiZ Nov 20 '12 at 05:57
Also: when it comes to network sockets, you should look at the difference between read() and recv(). io_submit() seems to support the read() semantic, not the recv() semantic. O_NONBLOCK is never needed for sockets if you use recv() and proper select() or event notification. – Jon Watte May 18 '14 at 16:19

Is there really no asynchronous block I/O on Linux?

4 Answers4

Software/projects using io_uring

Software investigating using io_uring

Linux distribution support for io_uring

Linked

Software/projects using `io_uring`

Software investigating using `io_uring`

Linux distribution support for `io_uring`