5

The following are the declarations of read and pread:

#include <unistd.h>
ssize_t read(int fd, void *buf, size_t count);
ssize_t pread(int fd, void *buf, size_t count, off_t offset); 

We all know that they have almost the same functionality, but which one is more efficient?

Add the use cases: 1.Scan a large file. 2.Random read one large file.

Charles
  • 175
  • 1
  • 8
  • it appears read twice, instead of pread proto – ShinTakezou Dec 13 '13 at 07:12
  • 3
    Well, one function seeks to a specific offset before reading, while the other does not. So you have to decide if using e.g. `lseek` followed by `read` is better or not than using only `pread`. One way to decide is to *benchmark* for your specific use-case. In fact, I would say it's the *only* way to decide, and that this question is unanswerable because we don't know your use-case. – Some programmer dude Dec 13 '13 at 07:15
  • 4
    Really, you worry about that? The difference, if there is any, is maybe a 5-6 clock cycles. A disk access is _25-30 million_ clock cycles... – Damon Dec 13 '13 at 07:58
  • If it is a "fill-in-the-bubbles" for your exam to answer brainlessly within 3 seconds i would answer : 1 (Scan large file): read, 2 (Random read) : pread. – philippe lhardy Dec 15 '13 at 21:57
  • @Damon - most of the time neither `read` nor `pread` will be accessing the disk, but rather the page cache which is a fully in-memory operation. – BeeOnRope Jan 30 '17 at 01:42
  • @BeeOnRope: No. That's outright wrong. Unless of course "large file" means less than 128kB to you. – Damon Jan 31 '17 at 14:16
  • @Damon - I'm not following. Caching is highly effective and the page cache is no different. Hit rates of 90% or even 99% are common. Of course there are workloads that are different (lots of writes) - but most hosts for more people, the large majority of `read` calls complete without touching the disk (and, mostly, without even talking to the file system). – BeeOnRope Jan 31 '17 at 14:56
  • @BeeOnRope: That's what theory says, but theory is a liar. Hit rates upwards of 90% are common for working sets that (a) fit into RAM and (b) have reasonably good locality. OP talks about (1) scanning and (2) reading random locations from one large file, which is neither (a) nor (b). The wording "large file" pretty much rules out caching, since if a file is small enough so it could in principle be mapped and completely backed by RAM, isn't a large file. With a dataset that doesn't fit RAM, readahead is your only companion, but contrary to popular belief, readahead is pretty dumb (no magic!). – Damon Jan 31 '17 at 15:45
  • For "scan", readahead will help, but you will still be making 128kB requests (256kB if you tell readahead to be aggressive). For random access, you're just out alone in the dark. You can hint the prefetcher to your random access pattern, but then it will simply not prefetch anything at all. Which isn't really much better. – Damon Jan 31 '17 at 15:50
  • Fair enough. I was referring to the general case - we don't know how large the OP's files are, after all (what's large for one person is small to another). With RAM sizes of 10s and 100s of GBs, even "large files" often end up the page cache these days. I know that when I benchmark things (like a huge build) that would be assumed to be IO bound, I often find _exactly 0_ bytes of read IO because everything is in the page cache. @Damon – BeeOnRope Jan 31 '17 at 17:27
  • Also, to clarify, neither scanning nor random access rule out fully cached access. The question is how did those files get there in the first place? Were they written by some other process? Were they copied from another device or did they come in over the network? In any of those cases the file could be fully cached. – BeeOnRope Jan 31 '17 at 17:31
  • Possible duplicate of [What is the difference between read and pread in unix?](https://stackoverflow.com/questions/1687275/what-is-the-difference-between-read-and-pread-in-unix) – Ciro Santilli OurBigBook.com Jul 15 '17 at 10:26

4 Answers4

9

Depends on how you define "efficient". If you're asking about performance the difference is microscopic. So use the one that solves the problem for you. In many cases pread is the only option when you're dealing with threads reading from a database or such. In other cases read is the only sensible option. And the question is a little bit unfair since pread does more than read. A fair comparison would be lseek + read which will definitely be slower than just pread.

Let's look at the differences in implementation of both in an operating system source I had available. I cut out the exact same code from both functions to highlight the differences. There's much more code than this.

This is part of pread:

vp = (struct vnode *)fp->f_data;
if (fp->f_type != DTYPE_VNODE || vp->v_type == VFIFO ||
        (vp->v_flag & VISTTY)) {
    return (ESPIPE);
}

offset = SCARG(uap, offset);
if (offset < 0 && vp->v_type != VCHR)
    return (EINVAL);

return (dofilereadv(p, fd, fp, &iov, 1, 0, &offset, retval));

This is the equivalent part of read:

return (dofilereadv(p, fd, fp, &iov, 1, 0, &fp->f_offset, retval));

So, pread does some extra checks to make sure that we're not trying to seek on a pipe, fifo, tty, etc. And it checks that we're not reading a negative offset from a character device. It also doesn't update the offset in the file pointer (fp in the code).

Art
  • 19,807
  • 1
  • 34
  • 60
3

It depends on the operation that you are going to do with read() and pread()

For example: If you are reading a file and you call read() twice, read() will automatically advance the pointer. But pread() stays at the same offset.

1

The main advantage of pread is that in a multithreaded program you don't need to serialize I/O in order to prevent another thread from changing the file offset.

Other than that, the difference is most likely in the noise (but measure it for your use case if you're really interested!). As your question is tagged "linux", IIRC in the Linux kernel the lower level I/O reading function has a pread-like interface. For the read() syscall, it looks up the offset from the file table and then calls the lower level reading function with that offset, whereas the pread() syscall uses the offset provided by the caller directly. So I'd guess that pread() would be slightly more efficient than lseek() + read() (one syscall less being the main advantage), but probably not anything worth worrying about in most cases since syscalls are relatively fast on Linux.

janneb
  • 36,249
  • 2
  • 81
  • 97
0

It's not really a matter of efficiency but of functionality.

If I want to read a file from position x and onwards I would use lseek (if needed) followed by as many read as I needed.

If I want to read from random places in the file and do a lot of jumping back and forth, then I would use pread instead of lseek + read.

One would expect pread to be slightly less efficient than read but slightly more efficient than lseek + read. The difference is probably not larger than a couple of clock cycles.

Klas Lindbäck
  • 33,105
  • 5
  • 57
  • 82
  • And as @janneb pointed out in other answer, if the file is shared between threads and the reads are unserialized, you'll prefer `pread`. – rodrigo Dec 13 '13 at 07:49