46

I have a Linux application that reads 150-200 files (4-10GB) in parallel. Each file is read in turn in small, variably sized blocks, typically less than 2K each.

I currently need to maintain over 200 MB/s read rate combined from the set of files. The disks handle this just fine. There is a projected requirement of over 1 GB/s (which is out of the disk's reach at the moment).

We have implemented two different read systems both make heavy use of posix_advise: first is a mmaped read in which we map the entirety of the data set and read on demand. The second is a read()/seek() based system.

Both work well but only for the moderate cases, the read() method manages our overall file cache much better and can deal well with 100s of GB of files, but is badly rate limited, mmap is able to pre-cache data making the sustained data rate of over 200MB/s easy to maintain, but cannot deal with large total data set sizes.

So my question comes to these:

A: Can read() type file i/o be further optimized beyond the posix_advise calls on Linux, or having tuned the disk scheduler, VMM and posix_advise calls is that as good as we can expect?

B: Are there systematic ways for mmap to better deal with very large mapped data?

Mmap-vs-reading-blocks is a similar problem to what I am working and provided a good starting point on this problem, along with the discussions in mmap-vs-read.

Community
  • 1
  • 1
Bill N.
  • 835
  • 1
  • 7
  • 13
  • 21
    This is a good example of how optimization/performance-related questions should be. It shows research, includes measured data and has well-defined objectives. After a while one gets tired of "have no idea *how fast it is*, want *faster*". Pity I can only +1 :) – R. Martinho Fernandes Nov 08 '11 at 21:17
  • Any chance you could replace your spinning disks with SSDs? That would avoid the head-seek penalties you are likely paying as you transition from one file to another... – Jeremy Friesner Nov 08 '11 at 21:50
  • wasn't the point of distributed computing (map-reduce) to solve such problems(use Hadoop)? – Ramadheer Singh Nov 08 '11 at 22:11
  • It is not clear exactly what problem you are having with mapping the entireity of your data set using `mmap()`. `mmap()` should have no problem mapping 100's of GBs (provided you are compiling a 64bit executable, anyway). – caf Nov 09 '11 at 00:39

3 Answers3

14

Reads back to what? What is the final destination of this data?

Since it sounds like you are completely IO bound, mmap and read should make no difference. The interesting part is in how you get the data to your receiver.

Assuming you're putting this data to a pipe, I recommend you just dump the contents of each file in its entirety into the pipe. To do this using zero-copy, try the splice system call. You might also try copying the file manually, or forking an instance of cat or some other tool that can buffer heavily with the current file as stdin, and the pipe as stdout.

if (pid = fork()) {
    waitpid(pid, ...);
} else {
    dup2(dest, 1);
    dup2(source, 0);
    execlp("cat", "cat");
}

Update0

If your processing is file-agnostic, and doesn't require random access, you want to create a pipeline using the options outlined above. Your processing step should accept data from stdin, or a pipe.

To answer your more specific questions:

A: Can read() type file i/o be further optimized beyond the posix_advise calls on Linux, or having tuned the disk scheduler, VMM and posix_advise calls is that as good as we can expect?

That's as good as it gets with regard to telling the kernel what to do from userspace. The rest is up to you: buffering, threading etc. but it's dangerous and probably unproductive guess work. I'd just go with splicing the files into a pipe.

B: Are there systematic ways for mmap to better deal with very large mapped data?

Yes. The following options may give you awesome performance benefits (and may make mmap worth using over read, with testing):

  • MAP_HUGETLB Allocate the mapping using "huge pages."

    This will reduce the paging overhead in the kernel, which is great if you will be mapping gigabyte sized files.

  • MAP_NORESERVE Do not reserve swap space for this mapping. When swap space is reserved, one has the guarantee that it is possible to modify the mapping. When swap space is not reserved one might get SIGSEGV upon a write if no physical memory is available.

    This will prevent you running out of memory while keeping your implementation simple if you don't actually have enough physical memory + swap for the entire mapping.**

  • MAP_POPULATE Populate (prefault) page tables for a mapping. For a file mapping, this causes read-ahead on the file. Later accesses to the mapping will not be blocked by page faults.

    This may give you speed-ups with sufficient hardware resources, and if the prefetching is ordered, and lazy. I suspect this flag is redundant, the VFS likely does this better by default.

Matt Joiner
  • 112,946
  • 110
  • 377
  • 526
  • The detination is eventually a network device, though it has to be processed and reordered before that. – Bill N. Nov 08 '11 at 21:47
  • I don't think `MAP_HUGETLB` is allowed for normal file-backed mappings. I think it's only allowed for anonymous mappings. (Or for files created in `/dev/hugepages`, I guess to allow hugepage shared memory between processes.) I don't think `MAP_NORESERVE` makes sense for shared file-backed mappings. It makes sense for private mappings, because writes to those mappings don't affect the file contents. But shared mappings are already backed by the file on disk, not swap space. – Peter Cordes May 08 '16 at 20:28
4

Perhaps using the readahead system call might help, if your program can predict in advance the file fragments it wants to read (but this is only a guess, I could be wrong).

And I think you should tune your application, and perhaps even your algorithms, to read data in chunk much bigger than a few kilobytes. Can't than be half a megabyte instead?

Basile Starynkevitch
  • 223,805
  • 18
  • 296
  • 547
  • Good idea. However, and I might be wrong, read(), say, invokes do_generic_read() which to the best of my knowledge automatically performs a readahead. Maybe fine-tuning the read-ahead window or something else could help, but probably this is already done in the background. – gnometorule Nov 08 '11 at 22:20
  • And yes to requesting more. This, say, triggers the automatic read-ahead algorithm. – gnometorule Nov 08 '11 at 22:21
1

The problem here doesn't seem to be which api is used. It doesn't matter if you use mmap() or read(), the disc still has to seek to the specified point and read the data (although the os does help to optimize the access).

mmap() has advantages over read() if you read very small chunks (a couple of bytes) because you don't have call the os for every chunk, which becomes very slow.

I would also advise like Basile did to read more than 2kb consecutively so the disc doesn't have to seek that often.

Tobias Schlegel
  • 3,970
  • 18
  • 22
  • 2
    Well, with mmap reading from a page which isn't mapped will result in a trap to the OS which then reads the data from disk, so it's not any cheaper than a read() call. Of course, reading the same page repeatedly is fast with mmap. – janneb Nov 08 '11 at 21:37
  • 1
    True, also the read to populate the mmap is async from the point of view of the application, so unlike a readahed() call, my file thread will not block unless I have exhausted the cached page, and the kernal is in IO wait. – Bill N. Nov 08 '11 at 21:49