Between read() and pread(), which way has more efficient?

Question

The following are the declarations of read and pread:

#include <unistd.h>
ssize_t read(int fd, void *buf, size_t count);
ssize_t pread(int fd, void *buf, size_t count, off_t offset);

We all know that they have almost the same functionality, but which one is more efficient?

Add the use cases: 1.Scan a large file. 2.Random read one large file.

Well, one function seeks to a specific offset before reading, while the other does not. So you have to decide if using e.g. `lseek` followed by `read` is better or not than using only `pread`. One way to decide is to *benchmark* for your specific use-case. In fact, I would say it's the *only* way to decide, and that this question is unanswerable because we don't know your use-case. — Some programmer dude, Dec 13 '13 at 07:15
Really, you worry about that? The difference, if there is any, is maybe a 5-6 clock cycles. A disk access is _25-30 million_ clock cycles... — Damon, Dec 13 '13 at 07:58
If it is a "fill-in-the-bubbles" for your exam to answer brainlessly within 3 seconds i would answer : 1 (Scan large file): read, 2 (Random read) : pread. — philippe lhardy, Dec 15 '13 at 21:57
@Damon - most of the time neither `read` nor `pread` will be accessing the disk, but rather the page cache which is a fully in-memory operation. — BeeOnRope, Jan 30 '17 at 01:42
@BeeOnRope: No. That's outright wrong. Unless of course "large file" means less than 128kB to you. — Damon, Jan 31 '17 at 14:16
@Damon - I'm not following. Caching is highly effective and the page cache is no different. Hit rates of 90% or even 99% are common. Of course there are workloads that are different (lots of writes) - but most hosts for more people, the large majority of `read` calls complete without touching the disk (and, mostly, without even talking to the file system). — BeeOnRope, Jan 31 '17 at 14:56
@BeeOnRope: That's what theory says, but theory is a liar. Hit rates upwards of 90% are common for working sets that (a) fit into RAM and (b) have reasonably good locality. OP talks about (1) scanning and (2) reading random locations from one large file, which is neither (a) nor (b). The wording "large file" pretty much rules out caching, since if a file is small enough so it could in principle be mapped and completely backed by RAM, isn't a large file. With a dataset that doesn't fit RAM, readahead is your only companion, but contrary to popular belief, readahead is pretty dumb (no magic!). — Damon, Jan 31 '17 at 15:45
For "scan", readahead will help, but you will still be making 128kB requests (256kB if you tell readahead to be aggressive). For random access, you're just out alone in the dark. You can hint the prefetcher to your random access pattern, but then it will simply not prefetch anything at all. Which isn't really much better. — Damon, Jan 31 '17 at 15:50
Fair enough. I was referring to the general case - we don't know how large the OP's files are, after all (what's large for one person is small to another). With RAM sizes of 10s and 100s of GBs, even "large files" often end up the page cache these days. I know that when I benchmark things (like a huge build) that would be assumed to be IO bound, I often find _exactly 0_ bytes of read IO because everything is in the page cache. @Damon — BeeOnRope, Jan 31 '17 at 17:27
Also, to clarify, neither scanning nor random access rule out fully cached access. The question is how did those files get there in the first place? Were they written by some other process? Were they copied from another device or did they come in over the network? In any of those cases the file could be fully cached. — BeeOnRope, Jan 31 '17 at 17:31
Possible duplicate of [What is the difference between read and pread in unix?](https://stackoverflow.com/questions/1687275/what-is-the-difference-between-read-and-pread-in-unix) — Ciro Santilli OurBigBook.com, Jul 15 '17 at 10:26

score 9 · Accepted Answer · answered Dec 13 '13 at 08:26

Depends on how you define "efficient". If you're asking about performance the difference is microscopic. So use the one that solves the problem for you. In many cases pread is the only option when you're dealing with threads reading from a database or such. In other cases read is the only sensible option. And the question is a little bit unfair since pread does more than read. A fair comparison would be lseek + read which will definitely be slower than just pread.

Let's look at the differences in implementation of both in an operating system source I had available. I cut out the exact same code from both functions to highlight the differences. There's much more code than this.

This is part of pread:

vp = (struct vnode *)fp->f_data;
if (fp->f_type != DTYPE_VNODE || vp->v_type == VFIFO ||
        (vp->v_flag & VISTTY)) {
    return (ESPIPE);
}

offset = SCARG(uap, offset);
if (offset < 0 && vp->v_type != VCHR)
    return (EINVAL);

return (dofilereadv(p, fd, fp, &iov, 1, 0, &offset, retval));

This is the equivalent part of read:

return (dofilereadv(p, fd, fp, &iov, 1, 0, &fp->f_offset, retval));

So, pread does some extra checks to make sure that we're not trying to seek on a pipe, fifo, tty, etc. And it checks that we're not reading a negative offset from a character device. It also doesn't update the offset in the file pointer (fp in the code).

score 3 · Answer 2 · answered Dec 13 '13 at 07:16

It depends on the operation that you are going to do with read() and pread()

For example: If you are reading a file and you call read() twice, read() will automatically advance the pointer. But pread() stays at the same offset.

score 1 · Answer 3 · answered Dec 13 '13 at 07:45

The main advantage of pread is that in a multithreaded program you don't need to serialize I/O in order to prevent another thread from changing the file offset.

Other than that, the difference is most likely in the noise (but measure it for your use case if you're really interested!). As your question is tagged "linux", IIRC in the Linux kernel the lower level I/O reading function has a pread-like interface. For the read() syscall, it looks up the offset from the file table and then calls the lower level reading function with that offset, whereas the pread() syscall uses the offset provided by the caller directly. So I'd guess that pread() would be slightly more efficient than lseek() + read() (one syscall less being the main advantage), but probably not anything worth worrying about in most cases since syscalls are relatively fast on Linux.

score 0 · Answer 4 · answered Dec 13 '13 at 07:29

It's not really a matter of efficiency but of functionality.

If I want to read a file from position x and onwards I would use lseek (if needed) followed by as many read as I needed.

If I want to read from random places in the file and do a lot of jumping back and forth, then I would use pread instead of lseek + read.

One would expect pread to be slightly less efficient than read but slightly more efficient than lseek + read. The difference is probably not larger than a couple of clock cycles.

And as @janneb pointed out in other answer, if the file is shared between threads and the reads are unserialized, you'll prefer `pread`. — rodrigo, Dec 13 '13 at 07:49

Between read() and pread(), which way has more efficient?

4 Answers4