42

I have to split a huge file into many smaller files. Each of the destination files is defined by an offset and length as the number of bytes. I'm using the following code:

private void copy(string srcFile, string dstFile, int offset, int length)
{
    BinaryReader reader = new BinaryReader(File.OpenRead(srcFile));
    reader.BaseStream.Seek(offset, SeekOrigin.Begin);
    byte[] buffer = reader.ReadBytes(length);

    BinaryWriter writer = new BinaryWriter(File.OpenWrite(dstFile));
    writer.Write(buffer);
}

Considering that I have to call this function about 100,000 times, it is remarkably slow.

  1. Is there a way to make the Writer connected directly to the Reader? (That is, without actually loading the contents into the Buffer in memory.)
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
ala
  • 7,070
  • 13
  • 47
  • 54
  • Are you splitting the file perfectly, i.e. could you rebuild the large file by just joining all the small files together? If so there are savings to be had there. If not, do the ranges of the small files overlap? Are they sorted in order of offset? – jamie Jan 29 '11 at 05:39

9 Answers9

49

I don't believe there's anything within .NET to allow copying a section of a file without buffering it in memory. However, it strikes me that this is inefficient anyway, as it needs to open the input file and seek many times. If you're just splitting up the file, why not open the input file once, and then just write something like:

public static void CopySection(Stream input, string targetFile, int length)
{
    byte[] buffer = new byte[8192];

    using (Stream output = File.OpenWrite(targetFile))
    {
        int bytesRead = 1;
        // This will finish silently if we couldn't read "length" bytes.
        // An alternative would be to throw an exception
        while (length > 0 && bytesRead > 0)
        {
            bytesRead = input.Read(buffer, 0, Math.Min(length, buffer.Length));
            output.Write(buffer, 0, bytesRead);
            length -= bytesRead;
        }
    }
}

This has a minor inefficiency in creating a buffer on each invocation - you might want to create the buffer once and pass that into the method as well:

public static void CopySection(Stream input, string targetFile,
                               int length, byte[] buffer)
{
    using (Stream output = File.OpenWrite(targetFile))
    {
        int bytesRead = 1;
        // This will finish silently if we couldn't read "length" bytes.
        // An alternative would be to throw an exception
        while (length > 0 && bytesRead > 0)
        {
            bytesRead = input.Read(buffer, 0, Math.Min(length, buffer.Length));
            output.Write(buffer, 0, bytesRead);
            length -= bytesRead;
        }
    }
}

Note that this also closes the output stream (due to the using statement) which your original code didn't.

The important point is that this will use the operating system file buffering more efficiently, because you reuse the same input stream, instead of reopening the file at the beginning and then seeking.

I think it'll be significantly faster, but obviously you'll need to try it to see...

This assumes contiguous chunks, of course. If you need to skip bits of the file, you can do that from outside the method. Also, if you're writing very small files, you may want to optimise for that situation too - the easiest way to do that would probably be to introduce a BufferedStream wrapping the input stream.

Jon Skeet
  • 1,421,763
  • 867
  • 9,128
  • 9,194
  • I know this is a two year old post, just wondered... is this still the fastest way? (i.e. Nothing new in .Net to be aware of?). Also, would it be faster to perform the `Math.Min` prior to entering the loop? Or better yet, to remove the length parameter as it can be calculated by means of the buffer? Sorry to be picky and necro this! Thanks in advance. – Smudge202 Apr 19 '12 at 09:13
  • 2
    @Smudge202: Given that this is performing IO, the call to Math.Min is certainly *not* going to be relevant in terms of performance. The point of having both the length parameter and the buffer length is to allow you to reuse a possibly-oversized buffer. – Jon Skeet Apr 19 '12 at 09:20
  • Gotcha, and thanks for getting back to me. I'd hate to start a new question when there is likely a good enough answer right here, but would you say, that if you wanted to read the first *x* bytes of a large number of files (for the purpose of grabbing the XMP metadata from a large number of files), the above approach (with some tweaking) would still be recommended? – Smudge202 Apr 19 '12 at 09:33
  • @Smudge202: Well the code above is for *copying*. If you only want to *read* the first x bytes, I'd still loop round, but just read into a right-sized buffer, incrementing the index at which the read will write into the buffer appropriately on each iteration. – Jon Skeet Apr 19 '12 at 09:34
  • Yup, I'm less interested in the writing part, I just wanted to confirm that the fastest way to read one file is also the fastest way to read many files. I imagined being able to P/Invoke for file pointers/offsets and from there being able to scan across multiple files with the same/less streams/buffers, which in my imaginary world of make believe, would possibly be even faster for what I want to achieve (though not applicable to the OP). If I'm not barking mad, probably best I start a new question. If I am, could you let me know so I don't waste even more peoples' time? :-) – Smudge202 Apr 19 '12 at 09:44
  • @Smudge202: Do you actually have a performance *problem* right now? Have you written the simplest code that works and found that it's too slow? Bear in mind that a lot can depend on context - reading in parallel may help if you're using solid state, but not on a normal hard disk, for example. – Jon Skeet Apr 19 '12 at 10:25
  • SO is advising I take this to a chat, I think I'll actually start a new question. Thank you thus far! – Smudge202 Apr 19 '12 at 10:34
30

The fastest way to do file I/O from C# is to use the Windows ReadFile and WriteFile functions. I have written a C# class that encapsulates this capability as well as a benchmarking program that looks at differnet I/O methods, including BinaryReader and BinaryWriter. See my blog post at:

http://designingefficientsoftware.wordpress.com/2011/03/03/efficient-file-io-from-csharp/

David Murdoch
  • 87,823
  • 39
  • 148
  • 191
Bob Bryan
  • 3,687
  • 1
  • 32
  • 45
6

How large is length? You may do better to re-use a fixed sized (moderately large, but not obscene) buffer, and forget BinaryReader... just use Stream.Read and Stream.Write.

(edit) something like:

private static void copy(string srcFile, string dstFile, int offset,
     int length, byte[] buffer)
{
    using(Stream inStream = File.OpenRead(srcFile))
    using (Stream outStream = File.OpenWrite(dstFile))
    {
        inStream.Seek(offset, SeekOrigin.Begin);
        int bufferLength = buffer.Length, bytesRead;
        while (length > bufferLength &&
            (bytesRead = inStream.Read(buffer, 0, bufferLength)) > 0)
        {
            outStream.Write(buffer, 0, bytesRead);
            length -= bytesRead;
        }
        while (length > 0 &&
            (bytesRead = inStream.Read(buffer, 0, length)) > 0)
        {
            outStream.Write(buffer, 0, bytesRead);
            length -= bytesRead;
        }
    }        
}
Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
  • 1
    Any reason for the flush at the end? Closing it should do that. Also, I think you want to subtract from length in the first loop :) – Jon Skeet Jun 05 '09 at 13:56
  • Good eyes Jon! The Flush was force of habit; from a lot of code when I pass streams in rather than open/close them in the method - it is convenient (if writing a non-trivial amount of data) to flush it before returning. – Marc Gravell Jun 05 '09 at 14:12
3

You shouldn't re-open the source file each time you do a copy, better open it once and pass the resulting BinaryReader to the copy function. Also, it might help if you order your seeks, so you don't make big jumps inside the file.

If the lengths aren't too big, you can also try to group several copy calls by grouping offsets that are near to each other and reading the whole block you need for them, for example:

offset = 1234, length = 34
offset = 1300, length = 40
offset = 1350, length = 1000

can be grouped to one read:

offset = 1234, length = 1074

Then you only have to "seek" in your buffer and can write the three new files from there without having to read again.

schnaader
  • 49,103
  • 10
  • 104
  • 136
3

Have you considered using the CCR since you are writing to separate files you can do everything in parallel (read and write) and the CCR makes it very easy to do this.

static void Main(string[] args)
    {
        Dispatcher dp = new Dispatcher();
        DispatcherQueue dq = new DispatcherQueue("DQ", dp);

        Port<long> offsetPort = new Port<long>();

        Arbiter.Activate(dq, Arbiter.Receive<long>(true, offsetPort,
            new Handler<long>(Split)));

        FileStream fs = File.Open(file_path, FileMode.Open);
        long size = fs.Length;
        fs.Dispose();

        for (long i = 0; i < size; i += split_size)
        {
            offsetPort.Post(i);
        }
    }

    private static void Split(long offset)
    {
        FileStream reader = new FileStream(file_path, FileMode.Open, 
            FileAccess.Read);
        reader.Seek(offset, SeekOrigin.Begin);
        long toRead = 0;
        if (offset + split_size <= reader.Length)
            toRead = split_size;
        else
            toRead = reader.Length - offset;

        byte[] buff = new byte[toRead];
        reader.Read(buff, 0, (int)toRead);
        reader.Dispose();
        File.WriteAllBytes("c:\\out" + offset + ".txt", buff);
    }

This code posts offsets to a CCR port which causes a Thread to be created to execute the code in the Split method. This causes you to open the file multiple times but gets rid of the need for synchronization. You can make it more memory efficient but you'll have to sacrifice speed.

HasaniH
  • 8,232
  • 6
  • 41
  • 59
  • 1
    Remember with this (or any threading solution) you can hit a stage where you will max out your IO: you will have hit your best throughput(ie if attempting to write hundreds/thousands of small files at the same time, several large files etc). I have always found that if I can make one file read/write efficiently there is little I can do to improve on that by parallelising (Assembly can help a lot, make read/writes in assembler and it can be spectacular, up to the IO limits, however it can be a pain to write, and you need to be sure you want direct hardware or BIOS level access to your devices – GMasucci Apr 09 '13 at 10:26
1

The first thing I would recommend is to take measurements. Where are you losing your time? Is it in the read, or the write?

Over 100,000 accesses (sum the times): How much time is spent allocating the buffer array? How much time is spent opening the file for read (is it the same file every time?) How much time is spent in read and write operations?

If you aren't doing any type of transformation on the file, do you need a BinaryWriter, or can you use a filestream for writes? (try it, do you get identical output? does it save time?)

JMarsch
  • 21,484
  • 15
  • 77
  • 125
1

Using FileStream + StreamWriter I know it's possible to create massive files in little time (less than 1 min 30 seconds). I generate three files totaling 700+ megabytes from one file using that technique.

Your primary problem with the code you're using is that you are opening a file every time. That is creating file I/O overhead.

If you knew the names of the files you would be generating ahead of time, you could extract the File.OpenWrite into a separate method; it will increase the speed. Without seeing the code that determines how you are splitting the files, I don't think you can get much faster.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
mcauthorn
  • 598
  • 2
  • 6
  • 15
0

No one suggests threading? Writing the smaller files looks like text book example of where threads are useful. Set up a bunch of threads to create the smaller files. this way, you can create them all in parallel and you don't need to wait for each one to finish. My assumption is that creating the files(disk operation) will take WAY longer than splitting up the data. and of course you should verify first that a sequential approach is not adequate.

TheSean
  • 4,516
  • 7
  • 40
  • 50
  • Threading may help, but his bottleneck is surely on the I/O -- the CPU is probably spending a lot of time waiting on the disk. That's not to say that threading wouldn't make any difference (for example, if the writes are to different spindles, then he might get a better performance boost than he would if it were all on one disk) – JMarsch Jun 05 '09 at 16:17
-1

(For future reference.)

Quite possibly the fastest way to do this would be to use memory mapped files (so primarily copying memory, and the OS handling the file reads/writes via its paging/memory management).

Memory Mapped files are supported in managed code in .NET 4.0.

But as noted, you need to profile, and expect to switch to native code for maximum performance.

Richard
  • 106,783
  • 21
  • 203
  • 265
  • 1
    Memory mapped files are page aligned so they are out. The problem here is more likely disc access time, and memory mapped files wouldnt help with that anyway. The OS is going to manage caching files whether they are memory mapped or not. – jamie Jan 29 '11 at 05:34