10

In c++, how expensive is it to use the istream::seekg operation?

EDIT: How much can I get away with seeking around a file and reading bytes? What about frequency versus magnitude of offset?

I have a large file (4GB) that I am parsing, and I want to know if it's necessary to try to consolidate some of my seekg calls. I would assume that the magnitude of differences in file location play a role--like if you seek more than a page in memory away, it will impact performance--but small seeking is of no consequence. Is this correct?

Brian
  • 3,453
  • 2
  • 27
  • 39
  • 2
    The position is just a pointer, changing it will not actually read file content. – WiSaGaN Jan 18 '14 at 05:18
  • Ah crap. Okay let me make my question more reasonable... – Brian Jan 18 '14 at 05:26
  • wait...no I had a good reason for asking this.. – Brian Jan 18 '14 at 05:26
  • Wait, nevermind, no I didn't.. – Brian Jan 18 '14 at 05:27
  • So you mean you will actually read a few bytes after each seeking? – WiSaGaN Jan 18 '14 at 05:32
  • If you're worried about performance, attenuate it by switching to memory-mapped I/O. If that does not suffice, try to group nearby seeks to improve cache hits. Use cachegrind to guide you. – jweyrich Jan 18 '14 at 05:33
  • @jweyrich: [The performance advantages of mmap are a total myth.](http://stackoverflow.com/a/6657136/768469) – Nemo Jan 18 '14 at 05:36
  • @jweyrich That's kind of what I want to know--how many bytes away from my current location would I be able to seek before I get performance degradation from swapping. – Brian Jan 18 '14 at 05:37
  • @Nemo: that affirmation requires a fair amount of data and graphs to proof. I don't see any. – jweyrich Jan 18 '14 at 05:38
  • @jweyrich: Just quoting Linus Torvalds and describing my own experience. (Sorry, no graphs.) Where I work, we routinely process files in the 500-1000 gigabyte range. None of our code uses mmap() because we learned long ago that it is usually the slowest option. – Nemo Jan 18 '14 at 05:41
  • @Nemo: Well, I have to agree that what Torvals says makes sense, but I believe it depends on *how* you access the data. I suspect the performance hit of page fault can be amortised very quickly if that page is accessed a lot. Memory-mapping might benefit you with L1,2,3 caching. That's something you can't get from `read` AFAIK. You could use `readahead` for this though. – jweyrich Jan 18 '14 at 05:50
  • 2
    @jweyrich: True, how you access it does matter. But the cost of one extra copy of a page from `read` is usually irrelevant, especially if you can then operate on that copy while in cache. The page table and TLB thrashing that `mmap` requires often ends up dominating. (I do have a hunch that this might be different on recent kernels with "transparent hugepages", but I never went back to try it...) The bottom line is there is no substitute for benchmarking. – Nemo Jan 18 '14 at 05:54
  • @Nemo: agreed. +1 to your answer as well. Benchmarking is the definitive answer. If there's a known access pattern, telling the kernel how to perform read-aheads could be of great help too - `fadvise`, `posix_fadvise`. – jweyrich Jan 18 '14 at 06:03

1 Answers1

9

This question is heavily dependent on your operating system and disk subsystem.

Obviously, the seek itself will take essentially zero time, since it just updates an offset. Actually reading will pull some data off of disk...

...but how much data depends on many things. Your disk has a cache which may have its own block size and may do some sort of read-ahead. Your RAID controller (if any) will have its own cache, possibly with its own block size and read-ahead.

Your kernel has a page cache -- all of free RAM, essentially -- and it will also probably do some sort of read-ahead. On Linux this is configurable, and the kernel will adapt it based on how sequential your access patterns appear to be, whether you have called posix_fadvise, etc.

All of these caches mean if you access some data, then access nearby data later, there is a chance the second access will not actually touch the disk at all.

If you have the option of coding so that you access the file sequentially, that is certainly going to be faster than random reads, especially small random reads. Seeking on a single mechanical disk takes something like 10ms, so you can do the math here. (Although seeking on a solid state drive is around 100 times faster.)

Large reads are generally better than small reads... Although processing data a few kilobytes at a time can be faster than larger blocks if it allows the processing to stay in cache.

In short, you will need to provide a lot more details about your system and your application to get a proper answer, and even then the most likely answer is "benchmark it".

Nemo
  • 70,042
  • 10
  • 116
  • 153