32

I am looking for the most efficient way to do asynchronous file I/O on linux.

The POSIX glibc implementation uses threads in userland.

The native aio kernel api only works with unbuffered operations, patches for the kernel to add support for buffered operations exist, but those are >3 years old and no one seems to care about integrating them into the mainline.

I found plenty of other ideas, concepts, patches that would allow asynchronous I/O, though most of them in articles that are also >3 years old. What of all this is really available in todays kernel? I've read about servlets, acalls, stuff with kernel threads and more things I don't even remember right now.

What is the most efficient way to do buffered asynchronous file input/output in todays kernel?

skaffman
  • 398,947
  • 96
  • 818
  • 769
Marenz
  • 2,722
  • 3
  • 20
  • 19
  • 1
    (2020) If your kernel is new enough (5.1+) you can [use `io_uring` and get good **buffered** asynchronous file I/O on Linux](https://stackoverflow.com/a/57451551/2732969). – Anon Sep 13 '20 at 19:07

4 Answers4

35

Unless you want to write your own IO thread pool, the glibc implementation is an acceptable solution. It actually works surprisingly well for something that runs entirely in userland.

The kernel implementation does not work with buffered IO at all in my experience (though I've seen other people say the opposite!). Which is fine if you want to read huge amounts of data via DMA, but of course it sucks big time if you plan to take advantage of the buffer cache.
Also note that the kernel AIO calls may actually block. There is a limited size command buffer, and large reads are broken up into several smaller ones. Once the queue is full, asynchronous commands run synchronously. Surprise. I've run into this problem a year or two ago and could not find an explanation. Asking around gave me the "yeah of course, that's how it works" answer.
From what I've understood, the "official" interest in supporting buffered aio is not terribly great either, despite several working solutions seem to be available for years. Some of the arguments that I've read were on the lines of "you don't want to use the buffers anyway" and "nobody needs that" and "most people don't even use epoll yet". So, well... meh.

Being able to get an epoll signalled by a completed async operation was another issue until recently, but in the meantime this works really fine via eventfd.

Note that the glibc implementation will actually spawn threads on demand inside __aio_enqueue_request. It is probably no big deal, since spawning threads is not that terribly expensive any more, but one should be aware of it. If your understanding of starting an asynchronous operation is "returns immediately", then that assumption may not be true, because it may be spawning some threads first.

EDIT:
As a sidenote, under Windows there exists a very similar situation to the one in the glibc AIO implementation where the "returns immediately" assumption of queuing an asynchronous operation is not true.
If all data that you wanted to read is in the buffer cache, Windows will decide that it will instead run the request synchronously, because it will finish immediately anyway. This is well-documented, and admittedly sounds great, too. Except in case there are a few megabytes to copy or in case another thread has page faults or does IO concurrently (thus competing for the lock) "immediately" can be a surprisingly long time -- I've seen "immediate" times of 2-5 milliseconds. Which is no problem in most situations, but for example under the constraint of a 16.66ms frame time, you probably don't want to risk blocking for 5ms at random times. Thus, the naive assumption of "can do async IO from my render thread no problem, because async doesn't block" is flawed.

Damon
  • 67,688
  • 20
  • 135
  • 185
  • @Damon -- A beautiful answer, thanks! When you say, "the 'official' interest in supporting buffered aio is not terribly great" that seems a good strong tell that aio doesn't add very much, that no matter how badly the user program wants to do aio, the kernel largely ignores him and just does what it wants anyway. – Pete Wilson Apr 15 '11 at 09:23
  • The obvious problem with userland implementations is that the kernel has to do unqualified guesswork (it has no information about the intent), and as such a proper kernel implementation could certainly be very beneficial. On the other hand, a proper bottom-up implementation without compromises is certainly not trivial to do, and probably the reluctancy comes from this fact. The amount of work/trouble probably doesn't seem to warrant it. Though in my opinion even the existing solutions (built on top, and making little compromises) could add a lot to many applications. – Damon Apr 15 '11 at 09:52
  • 4
    The cleanest solution, in my opinion, would be remove blocking IO alltogether and make all IO asynchronous. The "usual" blocking IO would then need to be re-implemented in the library. – Damon Apr 15 '11 at 09:54
  • Which is how it should have been done -- perhaps was planned to be done -- in the first place since, AFAICS, the only possible reason for synchronous input is to ensure that the user program know the read index *on the physical medium*. – Pete Wilson Apr 15 '11 at 10:45
  • 1
    the aio api of the kernel did work with buffered input, just that it blocked at io_submit. Admittedly, I didn't check the result of the buffer and possibly some of the return values. I guess the way to go for me will be to use glibc's posix api with the thread/callback notification and use eventfd to translate the notification to an epoll event. Thanks for the elaborate answer. – Marenz Apr 15 '11 at 11:02
  • though, I do wonder, if I do it myself with threads, I can set how big the stack allocated for a thread should be.. and a thread doing just this one thing hardly needs any stack. I wonder if the glibc implementation also uses an optimized stack size for its threads – Marenz Apr 15 '11 at 11:05
  • `__aio_enqueue_request` does not call `pthread_attr_setstacksize`, no. I would not worry about the stack size so much as about the actual spawning, though. I would prefer a small number of workers (maybe 3 or 4), which live for the entire duration of the program, pull request from a queue one by one, and perform them using standard blocking API. The tricky part is to have enough workers to keep the disk busy, but not too many as to cause excessive seeking. However, still... since the glibc implementation _works reasonably well_, I'd go with that unless there's a really urgent need. – Damon Apr 15 '11 at 14:13
  • @Damon -- "16.66ms frame time" meaning the minimum time a thread or process is going to run? Or do you mean something else? – Pete Wilson Apr 16 '11 at 15:14
  • 16.66ms or 1/60 second refers to the frame time that you usually assume as time between two buffer swaps when doing continuous animated graphics (though obviously there exist 85Hz and 100Hz devices and others). It is an importan figure because missing the frame time by as much as a nanosecond will mean that your application blocks for a full frame, if vsync is enabled, effectively turning 60 fps into 30 fps. This is an exemplary situation where "possibly 5ms" does really matter, because if you only have a fixed budget, you would not want to lose 1/3 of that unexpectedly. – Damon Apr 16 '11 at 15:34
5

The material seems old -- well, it is old -- because it's been around for long and, while by no means trivial, is well understood. A solution you can lift is published in W. Richard Stevens's superb and unparalleled book (read "bible"). The book is the rare treasure that is clear, concise, and complete: every page gives real and immediate value:

    Advanced Programming in the UNIX Environment

Two other such, also by Stevens, are the first two volumes of his Unix Network Programming collection:

   Volume 1: The Sockets Networking API (with Fenner and Rudoff) and
   Volume 2: Interprocess Communications

I can't imagine being without these three fundamental books; I'm dumbstruck when I find someone who hasn't heard of them.

Still more of Steven's books, just as precious:

   TCP/IP Illustrated, Vol. 1: The Protocols

Pete Wilson
  • 8,610
  • 6
  • 39
  • 51
  • We have UNiX Network-Programming by W. Richard Stevens, Bill Fenner and Andrew M. Rudoff lying around here. (and yes, I actually consulted it for this matter ;) Sounds like a good idea to get the ones you mentioned, too – Marenz Apr 15 '11 at 10:57
  • @Marenz - Yes, that's the volume 1 I mentioned above. The biggie is the first one, Advanced Programming in the Unix Env. With that book in hand, your new programming life of ease, comfort, and satisfaction will surprise and delight you :-) – Pete Wilson Apr 15 '11 at 14:55
3

(2021) If your Linux kernel is new enough (at least 5.1 but newer kernels bring improvements) then io_uring will be "the most efficient way to do asynchronous file input/output" *. That applies to both buffered and direct I/O!

In the Kernel Recipes 2019 video "Faster IO through io_uring", io_uring author Jens Axboe demonstrates buffered I/O via io_uring finishing in almost half the time of synchronous buffered I/O. As @Marenz noted, unless you want to userspace threads io_uring is the only way to do buffered asynchronous I/O because Linux AIO (aka libaio/io_submit()) doesn't have the ability to always do buffered asynchronous I/O...

Additionally, in the article "Modern storage is plenty fast." Glauber Costa demonstrates how careful use of io_uring with asynchronous direct I/O can improve throughput compared to using io_uring for asynchronous buffered I/O on an Optane device. It required Glauber to have a userspace readahead implementation (without which buffered I/O was a clear winner) but the improvement was impressive.


* The context of this answer is clearly in relation to storage (after all the word buffered was mentioned). For network I/O io_uring has steadily improved in later kernels to the extent that it can trade blows with things like epoll() and if it continues it will one day be either equal or better in all cases.

Anon
  • 6,306
  • 2
  • 38
  • 56
2

I don't think the Linux kernel implementation of asynchronous file I/O is really usable unless you also use O_DIRECT, sorry.

There's more information about the current state of the world here: https://github.com/littledan/linux-aio . It was updated in 2012 by someone who used to work at Google.

Community
  • 1
  • 1
cmccabe
  • 4,160
  • 1
  • 24
  • 10