1

I am working on something that requires reading from and writing to a large file (or equivalent) but is allowed fairly minimal memory to do it (I don't have the exact spec, but let's call the "large" 15GB and the "minimal" 16K). The file is accessed randomly, usually in chunks of 512 Bytes and it is guaranteed that sometimes consecutive reads will be significant distance apart - possibly literally opposite ends of the disk (or a small number of MB from either end). Currently I'm using pread/pwrite to hit the locations I want in the file (I was previously using fseek, but abandoned it in favor of p(wread|write) because reasons.

Accessing the file this way is (perhaps obviously) slow, and I'm looking for ways to optimise/speed up the performance as much as possible (with as limited use (read: NO) as possible of external libraries).

I don't mean to be too cagey about exactly what we're doing, so it might help to think of it as a driver for a file system. At one end of the disk we're accessing the file and directory tables, and at the other raw data - so we need to write file information and then skiup to the data. But even within such zones don't assume anything about the layout. There is no guarantee that multiple files (or even multiple chunks of a single file) will be stored contiguously - or even close together. This also means that we can't make assumptions about the order that data will be read.

A couple of things I have considered include:

  • Opening Multiple File Descriptors for different parts of the file (but I'm not sure there's any state associated with the FD and whether this would even have an impact)
  • A few smarts around caching data that I expect to be accessed several times in a short amount of time

I was wondering whether others might have been in a similar boat and/or have opinions (or articles they can link) that discuss different strategies to minimise the impact of reading. I guess I was always wondering whether pread is even the right choice in this situation.... Any thoughts/opinions/criticisms/etc more than welcome.

NOTE: The program will always run in a single thread (so options don't need to be thread-safe, but equally pushing the read to the background isn't an option either).

user679560
  • 59
  • 4
  • This looks semi-related to a [question I asked](https://stackoverflow.com/q/13171052/1553090) quite some time ago. There are a couple of good answers there, although I don't know if they'll cover anything you haven't already encountered yourself. – paddy Sep 03 '20 at 04:21
  • Optimizing disk seek times is a rather old and well-researched problem. What have your own research turned up? – Some programmer dude Sep 03 '20 at 05:46
  • What kind of disks? Many solid-state disks have no significant seek times, so doing a lot of work to optimize disk access may have little benefit. It's a bigger deal, of course, with magnetic disks. – Kevin Boone Sep 03 '20 at 08:32
  • @paddy - thanks for that - looks useful Kevin - the disks are USB and/or memory cards (so solid), but there definitely appears to be a lag using higher addressing (although I suppose there might be some other bottleneck in my code (e.g. calculating the offsets). – user679560 Sep 03 '20 at 23:39
  • @Some programmer dude: If my own research had turned up anything useful, I wouldn't be asking the question here.....So either my own research is inadequate (maybe I'm searching for the wrong things) or there's far less literature out there than you assert. Perhaps you might like to share your favourite link on the topic? – user679560 Sep 03 '20 at 23:41
  • @user679560 -- I guess if you suspect your offset calculation might be involved, you could try pre-computing a bunch of offsets and then reading them without further computation. I wonder if the device/driver/kernel is less inclined to buffer data from the end of the file, because it's less likely (in general) to be read repeatedly? Problem is, there's an awful lot that could be going on here, that would be fiddly to sort out. – Kevin Boone Sep 04 '20 at 14:49

0 Answers0