13

I have a C program that runs only weekly, and reads a large amount of files only once. Since Linux also caches everything that's read, they fill up the cache needlessly and this slows down the system a lot unless it has an SSD drive.

So how do I open and read from a file without filling up the disk cache?

Note:

By disk caching I mean that when you read a file twice, the second time it's read from RAM, not from disk. I.e. data once read from the disk is left in RAM, so subsequent reads of the same file will not need to reread the data from disk.

sashoalm
  • 75,001
  • 122
  • 434
  • 781
  • You'd think Linux would have some configuration regarding disk caching. Either way, is this really a C problem? You would have the same problem regardless of the programming language, wouldn't you? Have you tried running the program in valgrind? It could be that you have memory leaks. – autistic Mar 07 '13 at 08:22
  • 1
    That's true, but otherwise someone might have posted python code samples :) – sashoalm Mar 07 '13 at 08:24
  • Well, if you hadn't asked for C you would've got more "Linux" answers. Please answer all of my questions: Have you tried running your program in valgrind? – autistic Mar 07 '13 at 08:42
  • OK, I removed the C tag. – sashoalm Mar 07 '13 at 09:05

2 Answers2

10

I believe passing O_DIRECT to open() should help:

O_DIRECT (Since Linux 2.4.10)

Try to minimize cache effects of the I/O to and from this file. In general this will degrade performance, but it is useful in special situations, such as when applications do their own caching. File I/O is done directly to/from user space buffers. The O_DIRECT flag on its own makes at an effort to transfer data synchronously, but does not give the guarantees of the O_SYNC that data and necessary metadata are transferred. To guarantee synchronous I/O the O_SYNC must be used in addition to O_DIRECT.

There are further detailed notes on O_DIRECT towards the bottom of the man page, including a fun quote from Linus.

Community
  • 1
  • 1
NPE
  • 486,780
  • 108
  • 951
  • 1,012
  • 2
    Just to be sure, since it says "In general this will degrade performance" - it won't actually degrade it for me, right? Because I really am reading the files only once. – sashoalm Mar 07 '13 at 08:18
  • @sashoalm: I think that's exactly what it means. It'll degrade the performance of repeated reads. However, since you're not doing repeated reads, this won't apply to you. If anything, it should improve performance in your case, since you won't be needlessly polluting the cache. – NPE Mar 07 '13 at 08:18
  • 4
    IIRC, this will disable read ahead, which is likely where the performance degradation comes in – Hasturkun Mar 07 '13 at 08:25
  • 1
    Indeed. No caching means if you were to read char by char, then the drive would be seeking once for each char rather than seeking once for 4KB, which is 4096 times more seeking. I don't think this is what the OP wants. – autistic Mar 07 '13 at 08:53
  • 10
    @modifiablelvalue: That's not quite true: you wouldn't even be able to read one char using `O_DIRECT`. This is covered in detail in the man page I've linked to. To give just one quote: *Under Linux 2.4, transfer sizes, and the alignment of the user buffer and the file offset must all be multiples of the logical block size of the file system. Under Linux 2.6, alignment to 512-byte boundaries suffices.* – NPE Mar 07 '13 at 09:02
  • Note that using `O_DIRECT` will reduce performance if same files are read with other processes without `O_DIRECT` flag. In that case, both readers suffer a performance penalty. If you truly know that no other process is reading those files and you only need to read the files once, `O_DIRECT` may be a good solution. – Mikko Rantalainen Jan 04 '21 at 08:39
6

You can use posix_fadvise() with the POSIX_FADV_DONTNEED advice to request that the system free the pages you've already read.

Hasturkun
  • 35,395
  • 6
  • 71
  • 104
  • 2
    Thanks. Wouldn't POSIX_FADV_NOREUSE be more appropriate? Just looked at the link. – sashoalm Mar 07 '13 at 08:26
  • 4
    It probably would, but the documentation suggests it is a no-op. – Hasturkun Mar 07 '13 at 08:30
  • I would prefer using fadvise as opposed to O_DIRECT. You can even have another program periodically telling system that it doesn't need to cache certain files. I had it like that when parsing large logfiles with awstats. – Marki555 Mar 07 '13 at 14:49
  • 3
    You can use POSIX_FADV_NOREUSE _before_ reading the data and POSIX_FADV_DONTNEED _after_ reading. POSIX_FADV_NOREUSE is currently a no-op, but maybe it will be implemented someday. – lav Apr 22 '16 at 09:30
  • As suggested in https://github.com/jborg/attic/issues/252, the current implementation in Linux will unconditionally purge cached pages, possibly degrading the performance of other applications using the same files. – jan Sep 20 '17 at 07:54