2

I'm reading values from a huge file (> 10 GB) using the following code:

FileStream fs = new FileStream(fileName, FileMode.Open);
BinaryReader br = new BinaryReader(fs);

int count = br.ReadInt32();
List<long> numbers = new List<long>(count);
for (int i = count; i > 0; i--)
{
    numbers.Add(br.ReadInt64());
}

unfortunately the read-speed from my SSD is stuck at a few MB/s. I guess the limit are the IOPS of the SSD, so it might be better to read in chunks from the file.

Question

Does the FileStream in my code really read only 8 bytes from the file everytime the BinaryReader calls ReadInt64()?

If so, is there a transparent way for the BinaryReader to provide a stream that reads in larger chunks from the file to speed up the procedure?

Test-Code

Here's a minimal example to create a test-file and to measure the read-performance.

using System;
using System.Collections.Generic;
using System.Diagnostics;
using System.IO;

namespace TestWriteRead
{
    class Program
    {
        static void Main(string[] args)
        {
            System.IO.File.Delete("test");
            CreateTestFile("test", 1000000000);

            Stopwatch stopwatch = new Stopwatch();
            stopwatch.Start();
            IEnumerable<long> test = Read("test");
            stopwatch.Stop();
            Console.WriteLine("File loaded within " + stopwatch.ElapsedMilliseconds + "ms");
        }

        private static void CreateTestFile(string filename, int count)
        {
            FileStream fs = new FileStream(filename, FileMode.CreateNew);
            BinaryWriter bw = new BinaryWriter(fs);

            bw.Write(count);
            for (int i = 0; i < count; i++)
            {
                long value = i;
                bw.Write(value);
            }

            fs.Close();
        }

        private static IEnumerable<long> Read(string filename)
        {
            FileStream fs = new FileStream(filename, FileMode.Open);
            BinaryReader br = new BinaryReader(fs);

            int count = br.ReadInt32();
            List<long> values = new List<long>(count);
            for (int i = 0; i < count; i++)
            {
                long value = br.ReadInt64();
                values.Add(value);
            }

            fs.Close();

            return values;
        }
    }
}
user2033412
  • 1,950
  • 2
  • 27
  • 47
  • 1
    http://stackoverflow.com/questions/3033771/file-i-o-with-streams-best-memory-buffer-size – Sebastian Schumann Oct 19 '15 at 11:51
  • Just saw that my initial comment was flawed. The number of read commands could really be the cause there (else normally I would think of cache or memory size as the primary cause). Veras linked question should offer a good point there for your question. If reading in those chunks does not help then the bottleneck is somewhere else and you should also check memory usage and cpu usage and put that into your question. – Thomas Oct 19 '15 at 11:56
  • Can you use File.ReadAllBytes? And then make memorystream from the bytes and then use BitConverter to read data?. It is faster than seeking, – fhnaseer Oct 19 '15 at 12:50
  • Unfortunately I can't since File:ReadAllBytes only supports 2 GB-Files. – user2033412 Oct 19 '15 at 13:13
  • @user2033412: What is the value you are reading in your `count` variable? – displayName Oct 19 '15 at 15:00
  • @displayName: The number of values in the file. The file always has the amount in the first 32 Bit, after that, only long-values (64 Bit) follow. – user2033412 Oct 19 '15 at 15:53
  • @user2033412: I wish to know how many long values are there in the file. So what's the _value_ of `count`? – displayName Oct 19 '15 at 15:57
  • count is about 500 million to 4 billion (depends on what file it take). – user2033412 Oct 20 '15 at 05:55
  • I added example-code to create a file and to measure the read-performance. – user2033412 Oct 20 '15 at 11:54

3 Answers3

4

You should configure the stream to use SequentialScan to indicate that you will read the stream from start to finish. It should improve the speed significantly.

Indicates that the file is to be accessed sequentially from beginning to end. The system can use this as a hint to optimize file caching. If an application moves the file pointer for random access, optimum caching may not occur; however, correct operation is still guaranteed.

using (
    var fs = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 8192,
        FileOptions.SequentialScan))
{
    var br = new BinaryReader(fs);
    var count = br.ReadInt32();
    var numbers = new List<long>();
    for (int i = count; i > 0; i--)
    {
        numbers.Add(br.ReadInt64());
    }
}

Try read blocks instead:

using (
var fs = new FileStream(fileName, FileMode.Open, FileAccess.Read, FileShare.ReadWrite, 8192,
FileOptions.SequentialScan))
{
    var br = new BinaryReader(fs);
    var numbersLeft = (int)br.ReadInt64();
    byte[] buffer = new byte[8192];
    var bufferOffset = 0;
    var bytesLeftToReceive = sizeof(long) * numbersLeft;
    var numbers = new List<long>();
    while (true)
    {
        // Do not read more then possible
        var bytesToRead = Math.Min(bytesLeftToReceive, buffer.Length - bufferOffset);
        if (bytesToRead == 0)
            break;
        var bytesRead = fs.Read(buffer, bufferOffset, bytesToRead);
        if (bytesRead == 0)
            break; //TODO: Continue to read if file is not ready?

        //move forward in read counter
        bytesLeftToReceive -= bytesRead;
        bytesRead += bufferOffset; //include bytes from previous read.

        //decide how many complete numbers we got
        var numbersToCrunch = bytesRead / sizeof(long);

        //crunch them
        for (int i = 0; i < numbersToCrunch; i++)
        {
            numbers.Add(BitConverter.ToInt64(buffer, i * sizeof(long)));
        }

        // move the last incomplete number to the beginning of the buffer.
        var remainder = bytesRead % sizeof(long);
        Buffer.BlockCopy(buffer, bytesRead - remainder, buffer, 0, remainder);
        bufferOffset = remainder;
    }
}

Update in response to a comment:

May I know what's the reason that manual reading is faster than the other one?

I don't know how the BinaryReader is actually implemented. So this is just assumptions.

The actual read from the disk is not the expensive part. The expensive part is to move the reader arm into the correct position on the disk.

As your application isn't the only one reading from a hard drive the disk have to re-position itself every time an application requests a read.

Thus if the BinaryReader just reads the requested int it have to wait on the disk for every read (if some other application make a read in-between).

As I read a much larger buffer directly (which is faster) I can process more integers without having to wait for the disk between reads.

Caching will of course speed things up a bit, and that's why it's "just" three times faster.

(future readers: If something above is incorrect, please correct me).

jgauffin
  • 99,844
  • 45
  • 235
  • 372
  • Sounds good, unfortunately this didn't change anything either. I also tried different buffer-sizes from 4k to 128k without luck :-( – user2033412 Oct 19 '15 at 12:50
  • I added example-code to create a file and to measure the read-performance. – user2033412 Oct 20 '15 at 11:54
  • my second example which reads chunks from the file instead of using the binary reader. – jgauffin Oct 20 '15 at 14:11
  • 1
    OH MY GOD! your code is three times faster than my original-code! AWESOME!! – user2033412 Oct 24 '15 at 17:55
  • Your welcome :) You can probably make it a bit faster too. for instance by allocating the correct size from start for the list: `new List(fs.Length / sizeof(long));` and by experimenting with the buffer size (stream and byte buffer should have the same size) – jgauffin Oct 25 '15 at 07:39
  • @jgauffin: May I know what's the reason that manual reading is faster than the other one? – displayName Nov 19 '15 at 15:09
2

You can use a BufferedStream to increase the read buffer size.

krafty
  • 425
  • 4
  • 16
  • Sounds good, unfortunately it didn't change anything. I tried different buffer-sizes from 4k to 128k -- no difference at all :-( – user2033412 Oct 19 '15 at 12:05
  • 1
    @user2033412 Since you commented on your original question that a 2GB file is not sufficient, I would suspect that a 128k buffer is basically negligible. Try using 50-100MB before rejecting the buffered stream idea. Maybe bigger. – apokryfos Oct 19 '15 at 14:03
  • I added example-code to create a file and to measure the read-performance. – user2033412 Oct 20 '15 at 11:54
0

In theory memory mapped files should help here. You could load it into memory using several very large chunks. Not sure though how much is this relevant when using SSDs.