6

I have an input file in my application that contains a vast amount of information. Reading over it sequentially, and at only a single file offset at a time is not sufficient for my application's usage. Ideally, I'd like to have two threads, that have separate and distinct ifstreams reading from two unique file offsets of the same file. I can't just start one ifstream up, and then make a copy of it using its copy constructor (since its uncopyable). So, how do I handle this?

Immediately I can think of two ways,

  1. Construct a new ifstream for the second thread, open it on the same file.
  2. Share a single instance of an open ifstream across both threads (using for instance boost::shared_ptr<>). Seek to the appropriate file offset that current thread is currently interested in, when the thread gets a time slice.

Is one of these two methods preferred?

Is there a third (or fourth) option that I have not yet thought of?

Obviously I am ultimately limited by the hard drive having to spin back and forth, but what I am interested in taking advantage of (if possible), is some OS level disk caching at both file offsets simultaneously.

Thanks.

ypnos
  • 50,202
  • 14
  • 95
  • 141
J T
  • 4,946
  • 5
  • 28
  • 38

5 Answers5

12

Two std::ifstream instances will probably be the best option here. Modern HDDs are optimized for a large queue of I/O requests, so reading from two std::ifstream instances concurrently should give quite nice performance.

If you have a single std::ifstream you'll have to worry about synchronizing access to it, plus it might defeat the operating system's automatic sequential access read-ahead caching, resulting in poorer performance.

Cory Nelson
  • 29,236
  • 5
  • 72
  • 110
  • 3
    This is true only if the original access was random. If the original access was sequential then the random access induced by the two threads would make things worse. – Billy ONeal Jun 02 '11 at 15:47
  • Indeed. He explicitly states that reading it sequentially is not sufficient. In this case where he must perform random access, two concurrent requests are going to be better. – Cory Nelson Jun 02 '11 at 22:24
6

Between the two, I would prefer the second. Having two openings of the same file might cause an inconsistent view between the files, depending on the underlying OS.

For a third option, pass a reference or raw pointer into the other thread. So long as the semantics are that one thread "owns" the istream, the raw pointer or reference are fine.

Finally note that on the vast majority of hardware, the disk is the bottleneck, not CPU, when loading large files. Using two threads will make this worse because you're turning a sequential file access into a random access. Typical hard disks can do maybe 100MB/s sequentially, but top out at 3 or 4 MB/s random access.

Billy ONeal
  • 104,103
  • 58
  • 317
  • 552
4

Other option:

  • Memory-map the file, create as many memory istream objects as you want. (istrstream is good for this, istringstream is not).
Ben Voigt
  • 277,958
  • 43
  • 419
  • 720
  • Does your compiler need to support a certain C++ standard to use this? – J T Jun 02 '11 at 15:41
  • @J T: Memory mapping is not covered by the standard. You'd have to use whatever calls it would take on your platform. On POSIX that'll be `mmap`, on Windows that'll be `CreateFileMapping` + `MapViewOfFile` – Billy ONeal Jun 02 '11 at 15:46
  • Memory mapping is not part of the standard. Boost Interprocess has cross-platform memory mapping support, though. – Cory Nelson Jun 02 '11 at 15:48
  • I thought [std::strstream was deprecated](http://stackoverflow.com/questions/2820221/why-was-stdstrstream-deprecated])? – Flexo Jun 02 '11 at 15:49
  • 1
    @awoodland: it is. And it's deprecated in C++0x too. Since deprecated implies, "required to be present in any conforming implementation", we're good to go ;-) An alternative is to rewrite the code to operate directly on the mapped array, rather than via a stream, but using `istrstream`, which necessarily is read-only, is pretty harmless. – Steve Jessop Jun 02 '11 at 16:10
1

It really depends on your system. A modern system will generally read ahead; seeking within the file is likely to inhibit this, so should definitly be avoided.

It might be worth experimenting how read-ahead works on your system: open the file, then read the first half of it sequentially, and see how long that takes. Then open it, seek to the middle, and read the second half sequentially. (On some systems I've seen in the past, a simple seek, at any time, will turn off read-ahead.) Finally, open it, then read every other record; this will simulate two threads using the same file descriptor. (For all of these tests, use fixed length records, and open in binary mode. Also take whatever steps are necessary to ensure that any data from the file is purged from the OS's cache before starting the test—under Unix, copying a file of 10 or 20 Gigabytes to /dev/null is usually sufficient for this.

That will give you some ideas, but to be really certain, the best solution would be to test the real cases. I'd be surprised if sharing a single ifstream (and thus a single file descriptor), and constantly seeking, won, but you never know.

I'd also recommend system specific solutions like mmap, but if you've got that much data, there's a good chance you won't be able to map it all in one go anyway. (You can still use mmap, mapping sections of it at a time, but it becomes a lot more complicated.)

Finally, would it be possible to get the data already cut up into smaller files? That might be the fastest solution of all. (Ideally, this would be done where the data is generated or imported into the system.)

James Kanze
  • 150,581
  • 18
  • 184
  • 329
0

My vote would be a single reader, which hands the data to multiple worker threads.

If your file is on a single disk, then multiple readers will kill your read performance. Yes, your kernel may have some fantastic caching or queuing capabilities, but it is going to be spending more time seeking than reading data.

Nicolas
  • 1,106
  • 11
  • 25
  • There'll be just as much seeking on the disk when he seeks in the file. And constantly seeking will defeat any read-ahead strategies the OS might be using. – James Kanze Jun 02 '11 at 15:53
  • @james-kanze Reading a file sequentially from 2 different processes or threads will have a whole lot more disk seek time than a single process or thread seeking around the file. An analogy: Think of 2 people reading from the same book. – Nicolas Jun 02 '11 at 15:58
  • Ok, but in your book analogy, if a single person wanted to read chapters 4 and 8 simultaneously, wouldn't he have to flip the pages back and forth just as much? – J T Jun 02 '11 at 15:59
  • @J-t: Yes, but read OP's post: `..Reading over it sequentially..` Even if he wanted to read from chapter 4 and 8 at the same time, he'll still only be reading from one chapter at the any one time. He can decide when to flip to the other chapter. – Nicolas Jun 02 '11 at 16:02
  • "Reading over it sequentially ... is not sufficient" – J T Jun 02 '11 at 16:15