2

I need to read line by line four very large (>2 Gb) files simultaneously on a C# application. I'm using four different StreamReader objects and their ReadLine() method. Perfomance is seriously affected while reading lines from the four files at the same time, but getting better as far as each one of them reaches the EoF (perf with 4 files < perf with 3 files < perf with 2 files...).

I have this (simplified, assuming only two files for a cleaner example) code:

StreamReader readerOne = new StreamReader(@"C:\temp\file1.txt");
StreamReader readerTwo = new StreamReader(@"C:\temp\file2.txt");

while(readerOne.Peek() >= 0 || readerTwo.Peek() >= 0)
{
    string[] readerOneFields = readerOne.Peek() >= 0 ? 
        readerOne.ReadLine().Split(',') : null;
    string[] readerTwoFields = readerTwo.Peek() >= 0 ? 
        readerTwo.ReadLine().Split(',') : null;

    if (readerOneFields != null && readerTwoFields != null)
    {
        if (readerOneFields[2] == readerTwoFields[2])
        {
            // Do some boring things...
        }
    else if (readerOneFields != null)
    {
        // ...
    }
    else
    {
        // ...
    }
}
readerOne.Close();
readerTwo.Close();

The reason why I have to read those files at the same time is because I need to do some stuff comparing those lines, and afterwards write the results to a new file.

I've read a lot of questions regarding large file reading using StreamReader, but I couldn't find a scenario like the one I have. It's using ReadLine() method the proper way to accomplish that? Is it even the StreamReader the proper class?

UPDATE: things are getting weirder now. Just for testing I've tried to reduce the file sizes to about 10 Mb by deleting lines, leaving only 70K records. Furthermore, I have tried with only two files (instead of four) at the same time. And I'm getting the same poor performance while reading from the two files simultaneously! When one of them reaches EoF, performance gets better. I'm setting a StreamReader buffer size of 50 MB.

HiseaSaw
  • 165
  • 1
  • 1
  • 11
  • What is the medium you're reading from? A hard disk? – user3613916 Jul 04 '14 at 06:57
  • Oh yes, a hard disk. Sorry, I forgot to point this. – HiseaSaw Jul 04 '14 at 07:00
  • 1
    Why do you use the peak? ReadLine will return null when EOF. Using Peek for every line seems a bit inefficient for such large files.. – jgauffin Jul 04 '14 at 07:07
  • 2
    Then I'm afraid it might be a hardware bottleneck. Hard disks' performance is known to severely drop in situations like described - multiple files accessed at once - as head needs to continuously reposition. If possible, try different medium, like solid state drive. – user3613916 Jul 04 '14 at 07:08
  • @jgauffin you're absolutely right :-) Do you think using Peek evaluation is affecting perfomance in this case anyway? – HiseaSaw Jul 04 '14 at 07:09
  • Have you compared the performance just reading through both files and discarding the data (a) alternating and (b) file-by-file to you current implementation? HW bottleneck is most plausible, but if either of those is significantly faster, something can be done. Also, maybe BufferedStream: http://stackoverflow.com/questions/2161895/reading-large-text-files-with-streams-in-c-sharp – peterchen Jul 04 '14 at 07:50
  • Thanks @peterchen. I've tried with BufferedStream, with no luck. In fact, it seemed to be a little bit slower. As soon as it's reading only one file, performance gets significantly faster. Sincerely, I'm no longer think it's a HW bottleneck, since I am currently testing with two 10 Mb files only. – HiseaSaw Jul 04 '14 at 09:28

1 Answers1

8

By far the most expensive thing you could ever do with a disk is to force the reader head to move from one track to another. It is a mechanical motion, the typical cost is about 13 milliseconds per track.

You are moving the reader head, constantly having to go back and forth from one file to another. Buffering is required to reduce that cost, in other words reading a lot of data from one file in one gulp. The operating system already does some buffering, it reads a track worth of data from the file. You need more.

Use one of the StreamReader constructors that allows you to specify the buffer size. With files this large, a buffer size of 50 megabytes is appropriate.

Hans Passant
  • 922,412
  • 146
  • 1,693
  • 2,536
  • Thanks Hans. I've just tried with buffer sizes 32, 50 and even 100 Mb, and there's no improvements in perfomance. However, I'm completely agree with your response... a little bit "lost" with this. – HiseaSaw Jul 04 '14 at 07:33
  • The files are very large, it just takes a long time to read them from the disk. Minutes. Without any details at all about the kind of disk drive and how long this all takes, I have no idea how to help you further. Get a faster disk, SSDs are *very* nice. And above all, don't wait for the program to finish. A watched pot never boils. – Hans Passant Jul 04 '14 at 07:35
  • I would have said this as well but it looks like the process is bottlenecked on a different resource. Probably CPU. Use a profiler. – usr Jul 04 '14 at 07:58
  • +1. In addition, the writer stream should be buffered as well. – MarkO Jul 04 '14 at 08:36