How to determine reasonable number of bytes to read per read system call?

Question

I am playing with file reading/writing but have difficulty deciding how large to make my read buffer for the "read" system call.

In particular, I am looking at "http://pubs.opengroup.org/onlinepubs/009695399/functions/read.html"

It doesn't seem to say any restrictions on how many bytes I can read at once other than SSIZE_MAX.

To make matters worse, If I make an array with SSIZE_MAX characters, the program yields a:

sh: ./codec: Bad file number

Is there any reasonable way to decide how many bytes to read per read system call? My concern is that this may vary system to system (I can't just make as many reads as possible until a read fails to determine exact number of bytes I can read, and even if I do, it won't necessarily be any faster than reading less bytes).

One idea I had was to check my CPU cache size and try to make my buffer no larger than that, but since I don't know how CPU caches work, I am not sure if this is necessarily correct.

Thanks ahead of time.

"Bad file number" does sound related to the buffer size specified, unless something else is going on. Note that SSIZE_MAX is generally [*far too large* for a buffer](http://stackoverflow.com/questions/7349877/what-is-the-reading-limit-of-function-read-in-unistd-h). — user2864740, Feb 21 '16 at 21:23
@user2864740 Yeah. `SSIZE_MAX` is 2 gigabytes on a 32 bit system and some 8 exabytes on a 64 bit system. — fuz, Feb 21 '16 at 21:25
The error code `EBADFD` (bad file number) probably comes from a different source, probably you forgot to check if opening the file you want to read was actually succesful. — fuz, Feb 21 '16 at 21:26
@Dmitry: If you saw [this](http://i.stack.imgur.com/mzRt4.png) performance plot, what would you say a reasonable buffer size is? (Green is read, red is write.) — user541686, Feb 21 '16 at 22:38
I am not familiar enough with benchmarking to know what I am looking at. There are no labels. I wanted to know if there is a more systematic, or at least a naive way to determine a reasonable buffer size. I am a bit concerned this is so complicated. — Dmytro, Feb 21 '16 at 22:52
@Dmitry: Sorry for the lack of labels. The vertical axis is the average speed and the horizontal axis is the buffer size used. If you saw a plot like that, what would you say a reasonable buffer size is? (I'm trying to suggest that this is complicated because there does not exist such a thing as an optimal block size, and that what is "reasonable" is also ultimately a subjective quantity and depends on your latency requirements and the application.) — user541686, Feb 21 '16 at 23:04
As a general rule of thumb, the larger the buffer, the higher the throughput (because the system knows more about the task and can therefore schedule the I/O better) but also the higher then latency. Sometimes, depending on the implementation, *very* large buffers can become slower due to other factors, but quite often, by the time you reach that point, your buffer size is already too high for your latency requirements (e.g. you can't provide UI updates to the user or estimate progress with reasonable smoothness in that case). — user541686, Feb 21 '16 at 23:09
Also, note that on SSDs, there is a *lot* of noise involved and the speed can vary **dramatically** due to wear leveling and whatnot (even by a factor of 10) so that can make benchmarking difficult too. Anyway, there is no single number the system can give you. If your work is I/O-bound, you might want to implement a controller (this needs a bit of math and statistics to get working well) that asymptotically learns the optimal block size. If your work is CPU-bound, consider memory-mapped I/O to abstract away I/O. If it's in between, just hard-code a reasonable block size like 64K. — user541686, Feb 21 '16 at 23:18
@Dmitry: Another way to see why this is so complicated is to realize that different portions the file may not even be located on the same storage. For example on RAID (and LVM?) systems some parts may be scattered across disks, and reading different portions may take different amounts of time due to parallelization or individual disk activity. Or, on a virtual disk (e.g. an incremental VHD), different portions of the file can be on entirely unrelated storage media. So that's another level of complication (that people typically ignore) that illustrates why 1 number isn't sufficient. — user541686, Feb 21 '16 at 23:21

score 5 · Accepted Answer · answered Feb 22 '16 at 01:53

I've pondered basically the same question, and I've come to a very simple conclusion:

Use a conservative default or heuristic, but let the user override it easily if they want.

You see, in some cases the user might not want the maximum throughput for your utility, but perhaps do whatever it is on the background. Perhaps the task is just not that important. Personally, in Linux, I often use nice and ionice utilities to put long-but-not-priority tasks on the back burner, so to speak, so that they don't interfere with my actual work.

Benchmarks within the last decade indicate 128k to 2M block sizes (2¹⁷ to 2²¹ bytes) to consistently work well -- not far from optimal rates in almost all situations --, with the average slowly shifting towards the larger end of that range. Typically, powers of two sizes seem to work better than non-powers-of-two, although I haven't seen enough benchmarks of various RAID configurations to trust that fully.

Because your utility will almost certainly be recompiled for each new hardware type/generation, I'd prefer to have a default block size, defined at compile time, but have it trivially overridden at run time (via a command-line option, environment variable, and/or configuration file).

If your utility is packaged for current POSIXy OSes, the binaries could use a default that seems to suit best for the types of tasks done on that machine; for example, Raspberry Pis and other SBCs often don't have that much memory to start with, so a smaller (say, 65536 bytes) default block size might work best. Desktop users might not care about memory hogs, so you might use a much larger default block size on current desktop machines.

(Servers, and in high performance computing (which is where I've pondered about this), the block size is basically either benchmarked on the exact hardware and workload, or it is just a barely-informed guess. Typically the latter.)

Alternatively, you could construct a heuristic based on the st_blksizes of the files involved, perhaps multiplied by a default factor, and clamped to some preferred range. However, such heuristics tend to bit-rot fast, as hardware changes.

With heuristics, it is important to remember that the idea is not to always achieve the optimum, but to avoid really poor results. If a user wants to squeeze out the last few percent of performance, they can do some benchmarking within their own workflow, and tune the defaults accordingly. (I personally have, and do.)

Just one caveat: it doesn't even take bit-rot to make `st_blksize` useless. I've seen way too many instances where someone builds something like a 28-disk RAID-5 array on a disk storage system, with a 1MB per-disk segment size (bigger **must** be better!) leading to a 27MB RAID-5 stripe size. Then they build a file system with a 4kB blocksize on top of it. — Andrew Henle, Feb 22 '16 at 10:23
@AndrewHenle: Exactly! I did pretty much that myself, years and years ago, when I started learning how to setup computing clusters. Knowing what kind of rates I should be able to get made me scratch my head a bit at first, but eventually I learned why I didn't achieve it, and how to do it right (uh, better). I've used (Linux software) RAID-0 and RAID-1 on desktop for years, but I still need quite a bit of time to benchmark and test larger RAID-5 arrays before I'd be happy with an installation: so many variables (hardware, settings, filesystems, varying needs). — Nominal Animal, Feb 22 '16 at 11:52

score 0 · Answer 2 · answered Feb 24 '16 at 06:45

Well, determining the proper buffer size is completely problem dependant. First of all, I'll check the state of the art in buffer size determining: stdio uses BUFSZ as the buffer size (which is normally the size of one unix disk block, once fixed to 512, and probably now somewhat between 1024 and 4096 ---from a disk block size to a virtual page size---) This is far to low for the quantities that are moving around here, but is a well (and thought) acceptable value.

On other side, think an embedded system with only 8Kb mem, and using a one megabyte buffer storage. It sounds somewhat strange to be using virtual memory for buffer storage (if allowable).

Suppose you are designing a file copy utility, in which the best buffer size determining will be a best. Probably, thinking that the largest admisible value would be a must. But after some testing you'll get to a lot of misused memory. Suppose you design your process to make only one thread to act as a reader and a writer. You make a read of data, then write that data to another file. The first thing that comes around is that the much memory you use, doesn't affect, as that only affects the order of writes and reads from your process... If one read means a disk read (suppose one block at a time) something above the block size of your disk will not make you to make extra reads to get the same data (and this is actually done at the system level, that buffers your data, making it feasible to read data in one single byte chunks, while the system is reading data block by block)

The second approach is to minimize system calls. Everybody knows at this point that making a system call is some expensive thing, so if we can arrange to make the minimum system calls, we'll get something good. But after a while, you'll get that there's no extra performance in that, as the system is reading your data block by block, and your process is waiting for it, making the system call penalty completely invisible as it represents less than 1% of the wait time. Also, the system has to warrant your data doesn't change in the mean while, making atomic read calls (this is done by locking the inode of your file from the beginning to the end) so no process can actually refer to the same file until you finish your call. Allowing a large buffer, makes your process probably too large to fit in memory and loads your system with extra swap processing. This makes you to be careful before going to large buffers.

Lastly, extra system call penalty is so little penalty compared with the ones that arise from using large buffers, that has no use at all. If you live in a normal size system (let's suppose a laptop or desktop computer with some amount of ram in the order of 2-8Gb) a buffer size of 8Kb will be probably nice for all the scenarios.

One final thing to consider is the audio/video streaming buffer size determination. In this case there's normally a producer (the reader of data to be reproduced) that produces data at different speeds (variable over time on network load, for example) and a consumer that eats those data at a fixed rate (let's say 8kbps for a telco call, 192kbps for a cd play, etc) the buffer size must allow to compensate for the data provision speed changes to not empty the buffer, as then, we'll have a no-sound period to fill the gap. This is dynamic in nature, and you'll have to calculate in prevision of possible network packet loss and retransmission allowance. By delaying the consumer and filling some buffer with streaming data allows you to compensate and make the data consumer happy with no data loss. In this case, 2Mb of streaming buffer are common sense in some scenarios, depending on the probability of data loss, retransmission time and streaming quality you want to afford.

score -1 · Answer 3 · answered Feb 21 '16 at 21:24

-1

Call stat() or fstat() on the file you want to read. The struct stat member st_blksize contains the optimal buffer size you should use for reading from the file you called stat() on.

answered Feb 21 '16 at 21:24

fuz

88,405
25
200
352

1

-1 Uh that's the optimal block size, not the optimal buffer size. It's meant to minimize read-modify-write, not to maximize performance. The more data you want to process the larger your buffer should be (and really, there is no such thing as *the* optimal buffer size). – user541686 Feb 21 '16 at 21:30
I'll correct my question to "reasonable buffer size". – Dmytro Feb 21 '16 at 21:31
@Dmitry: Okay, but note, I think a reasonable buffer size is much larger than `st_blksize`... e.g. if `st_blksize` is 64K then a reasonable buffer size might be 1M. – user541686 Feb 21 '16 at 21:31
@Mehrdad OP asked for what buffer size is correct for the `read()` system call (cf. the first sentence of his question). `st_blksize` is literally [documented](http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/sys_stat.h.html) as *“A file system-specific preferred I/O block size for this object. In some file system types, this may vary from file to file.”* Why do you think this is not the right answer? – fuz Feb 21 '16 at 21:32
@FUZxxl: Because if you read [this documentation](http://man7.org/linux/man-pages/man2/stat.2.html) it says *"The `st_blksize` field gives the "preferred" **block** size for efficient filesystem I/O. (Writing to a file in smaller chunks may cause an inefficient **read-modify-rewrite**.)"*, i.e. (1) it's meant to avoid wearing down the drive, not to maximize your throughput, and (2) it's a block size, which means your buffer size should merely be a *multiple* of it. – user541686 Feb 21 '16 at 21:33
@Mehrdad I don't follow your argument. The member `st_blksize` has always been used with the semantic of “make your IO buffers that large” and that's the intended usage. It's also what almost all stdio implementations do. (Source: personal conversation with a POSIX committee member) There is in fact an upper reasonable bound on IO block size as larger transfers might hit DMA limits and are thus inefficient. On x86, this is typically 128 kB, which is incidentally what FreeBSD returns on my system. – fuz Feb 21 '16 at 21:35
@FUZxxl: It should make more sense if you consider that "the optimal buffer size" is not only rather subjective (latency vs. throughput has no global optimum; it's a tradeoff), but it also depends on how much data you're going to process. – user541686 Feb 21 '16 at 21:35
@FUZxxl: Well then are you saying sizes larger than `st_blksize` won't be much faster? – user541686 Feb 21 '16 at 21:36
@Mehrdad In fact, they might be slower as a read larger than `st_blksize` may be split into multiple DMA transfers instead of being processed at once. If other processes attempt to read at the same time, this might lead to your process being queued for the next partial transfer. If you chose a smaller blocksize (i.e. what the system recommends you), you could process the data you already got while the system lets another process read. – fuz Feb 21 '16 at 21:38
@Mehrdad It's hard for me to wrap around my had around your argument. The operating system literally gives you a recommendation for the buffer size you should use and you downvote me for recommending to use that recommendation? – fuz Feb 21 '16 at 21:40
@FUZxxl: Regarding the downvote, it's because the field is very clearly talking about the **block** size, not **buffer** size. (The `blk` in `st_blksize` does not stand for "buffer".) Do you not see the difference? Regarding the actual argument, have you not considered that on hardware nowadays reading more data than the optimal block size can be faster because (1) more data can be read in parallel, and (2) because every read has overhead (like a system call, among other things)? – user541686 Feb 21 '16 at 21:42
@Mehrdad The member name `st_blksize` is slightly misleading. It used to be the size of one sector, but this is only the case on historical systems. As documented in POSIX (I linked and cited the documentation above), it's the preferred IO block size for the object, i.e. the preferred size of data blocks you should transfer between your process and the system by means of `read()` and `write()`. Try it out on your own! For most file systems on Linux, `st_blksize` is 8kB, on other systems, like FreeBSD, the value can be larger, e.g. 128kB. – fuz Feb 21 '16 at 21:45
@FUZxxl: What doesn't even make sense to me is how could the value be a constant on an OS the way you're suggesting? Shouldn't it depend on the actual hardware? – user541686 Feb 21 '16 at 21:48
@Mehrdad To actually get the block size of the file system, you could use `statvfs` which nicely illustrates the difference between file system block size (member `f_frsize`) and preferred IO request size (member `f_bsize`). The latter is a general suggestion for the file system you call `statvfs` on, but individual files might have different optimums as indicated in the result of the respective `stat()` call. – fuz Feb 21 '16 at 21:48
@Mehrdad It's not necessarily a constant, which is why it's part of `struct stat`, so it can be different for each file. Typically, the file system returns the same value for each file depending on how it was created, but some file systems return different values for different files. It also depends on the hardware (see my remark about DMA transfer sizes above) and the operating system might pick an optimal value depending on hardware. – fuz Feb 21 '16 at 21:50
@FUZxxl: "It's not necessarily a constant"... well pedantically it's not if you don't define it to be a constant, but what I'm asking you is, do you **actually** get different for different drives on your system? Because on my Ubuntu 15.04, I get back 4096 both on my fast SSD and on my crappy slow flash drive. Are you seriously going to tell me it's the optimal block size for both? (And yes, the files I tried them on were huge.) – user541686 Feb 21 '16 at 21:57
@Mehrdad If that's what your system suggests, it's probably what your operating system believes to be optimal. – fuz Feb 21 '16 at 22:03
@FUZxxl: Yes, and you're blissfully ignoring the fact that it's nonsense. Why don't you tell me what ***your*** system tells you for your slowest drive and your fastest drive, and then tell me how fast those drives actually are? Go ahead, I'll wait. – user541686 Feb 21 '16 at 22:04
@Mehrdad It tells me to use 128k for both hard disks I have in my computer. My Linux systems like 4k buffers and honestly, that's okay. How about we do some benchmarks to demonstrate the time difference? – fuz Feb 21 '16 at 22:08
@FUZxxl: Sure, if you can write a benchmark to support your answer, I'll run it and tell you what I get. I'm not going to spend the time to write it though. – user541686 Feb 21 '16 at 22:09
@Mehrdad You are the one claiming that the POSIX standard lies. How about you back up your claims? I accept if you can demonstrate that yes, there is a suggestion about the ideal IO block size but you should disregard it and use xyz instead, but I'm a bit pissed that you continue to claim that `st_blksize` means something different that what it does. – fuz Feb 21 '16 at 22:12
@FUZxxl: A few notes: (1) *you're* the one claiming the definition is wrong; I'm reading it *quite literally* as the *block* size, so I think the standard is closer to the truth than you seem to think. (2) Standards don't lie; implementations do. (3) You're the one who wrote the answer and needs to support it, not me. I already gave you what I deem was sufficient proof that your claim is nonsense, which was the fact that I cannot see how a 200 MB/s SSD and a 5 MB/s flash drive could possibly both have an optimal buffer size of 4K. If you don't think that's nonsense then go ahead and prove it. – user541686 Feb 21 '16 at 22:20
@Mehrdad It *is* a blocksize. The recommended block size for IO transfers. An IO transfer is a read or write system call and a block you transfer is the size of the buffer you give to the system call. The Linux man page is a bit misleading in this regard, but the POSIX page I linked and cited above is fairly clear. – fuz Feb 21 '16 at 22:26
When you write a benchmark I'll take you seriously. – user541686 Feb 21 '16 at 22:28
@Mehrdad How about you write an answer with your suggestions to be actually helpful to OP? Right now you have only reiterated that I'm wrong without actually presenting what you believe is right. – fuz Feb 21 '16 at 22:29
@FUZxxl: When I get the chance I'll do it, don't worry. And I'm sure you will write your benchmark too, right? In the meantime, just for your viewing pleasure, [**here**](http://i.stack.imgur.com/mzRt4.png) is what [FlashBench](http://usbflashspeed.com/static/FlashBench.zip) (a Windows benchmarking program) thinks about the same SSD whose optimal buffer size Linux believes to be 4K. – user541686 Feb 21 '16 at 22:32
@Mehrdad Perhaps you should write a bug report against the Linux kernel for returning bogus values in `st_blksize`. – fuz Feb 21 '16 at 22:40
@FUZxxl: Perhaps you should stop volunteering me for doing things that you care about and start volunteering yourself instead. – user541686 Feb 21 '16 at 22:42
@Mehrdad I have said everything I need to say. I have provided authoritive sources. I have explained the meaning of the sources. I have asked a member of the POSIX committee about the intent of `st_blksize` (he reaffirmed my interpretation). I don't think there is anything more I need to do. – fuz Feb 21 '16 at 22:48
Great, I think the discussion is over then. – user541686 Feb 21 '16 at 22:50
@Mehrdad, if you downvote, let's upvote FUZxx then, as he is completely right in what he says (he'll get more rep than you downvoting). You haven't justified your pull up of buffer size. Just tell what is expected from a bigger buffer once you have both, producer and consumer, blocked in read/write calls, with no actual processing at all, but more swap space wasted in a system. The OP is asking for a good/best buffer size, not for the largest. And it seems you only think that larger is better, but please, if you do, justify why. – Luis Colorado Feb 24 '16 at 06:59

How to determine reasonable number of bytes to read per read system call?

3 Answers3