29

I'd like to know how I can split a large file without using too many system resources. I'm currently using this code:

public static void SplitFile(string inputFile, int chunkSize, string path)
{
    byte[] buffer = new byte[chunkSize];

    using (Stream input = File.OpenRead(inputFile))
    {
        int index = 0;
        while (input.Position < input.Length)
        {
            using (Stream output = File.Create(path + "\\" + index))
            {
                int chunkBytesRead = 0;
                while (chunkBytesRead < chunkSize)
                {
                    int bytesRead = input.Read(buffer, 
                                               chunkBytesRead, 
                                               chunkSize - chunkBytesRead);

                    if (bytesRead == 0)
                    {
                        break;
                    }
                    chunkBytesRead += bytesRead;
                }
                output.Write(buffer, 0, chunkBytesRead);
            }
            index++;
        }
    }
}

The operation takes 52.370 seconds to split a 1.6GB file into 14mb files. I'm not concerned about how long the operation takes, I'm more concerned about the system resource used as this app will be deployed to a shared hosting environment. Currently this operation max's out my systems HDD IO usage at 100%, and slows my system down considerably. CPU usage is low; RAM ramps up a bit, but seems fine.

Is there a way I can restrict this operation from using too many resources?

Thanks

Bruce Adams
  • 12,040
  • 6
  • 28
  • 37

5 Answers5

34

It seems odd to assemble each output file in memory; I suspect you should be running an inner buffer (maybe 20k or something) and calling Write more frequently.

Ultimately, if you need IO, you need IO. If you want to be courteous to a shared hosting environment you could add deliberate pauses - maybe short pauses within the inner loop, and a longer pause (maybe 1s) in the outer loop. This won't affect your overall timing much, but may help other processes get some IO.

Example of a buffer for the inner-loop:

public static void SplitFile(string inputFile, int chunkSize, string path)
{
    const int BUFFER_SIZE = 20 * 1024;
    byte[] buffer = new byte[BUFFER_SIZE];

    using (Stream input = File.OpenRead(inputFile))
    {
        int index = 0;
        while (input.Position < input.Length)
        {
            using (Stream output = File.Create(path + "\\" + index))
            {
                int remaining = chunkSize, bytesRead;
                while (remaining > 0 && (bytesRead = input.Read(buffer, 0,
                        Math.Min(remaining, BUFFER_SIZE))) > 0)
                {
                    output.Write(buffer, 0, bytesRead);
                    remaining -= bytesRead;
                }
            }
            index++;
            Thread.Sleep(500); // experimental; perhaps try it
        }
    }
}
Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
7

I have modified the code in the question a bit in case you wanted to split by chunks while making sure each chunk ends on a line ending:

    private static void SplitFile(string inputFile, int chunkSize, string path)
    {
        byte[] buffer = new byte[chunkSize];
        List<byte> extraBuffer = new List<byte>();

        using (Stream input = File.OpenRead(inputFile))
        {
            int index = 0;
            while (input.Position < input.Length)
            {
                using (Stream output = File.Create(path + "\\" + index + ".csv"))
                {
                    int chunkBytesRead = 0;
                    while (chunkBytesRead < chunkSize)
                    {
                        int bytesRead = input.Read(buffer,
                                                   chunkBytesRead,
                                                   chunkSize - chunkBytesRead);

                        if (bytesRead == 0)
                        {
                            break;
                        }

                        chunkBytesRead += bytesRead;
                    }

                    byte extraByte = buffer[chunkSize - 1];
                    while (extraByte != '\n')
                    {
                        int flag = input.ReadByte();
                        if (flag == -1)
                            break;
                        extraByte = (byte)flag;
                        extraBuffer.Add(extraByte);
                    }

                    output.Write(buffer, 0, chunkBytesRead);
                    if (extraBuffer.Count > 0)
                        output.Write(extraBuffer.ToArray(), 0, extraBuffer.Count);

                    extraBuffer.Clear();
                }
                index++;
            }
        }
    }
Michael Bahig
  • 748
  • 8
  • 17
0

This is a problem for your host, not you. Assuming this is absolutely the thing you need to do then pretty much you are doing it the most efficient way you can. It's up to them to manage resources according to load, priority, SLA etc. in the same way your Hypervisor/VM/OS/App Server/whatever does.

Split files away and use the facilities you have paid for!

LoztInSpace
  • 5,584
  • 1
  • 15
  • 27
0

Currently this operation max's out my systems HDD IO usage at 100%.

This is logical - the IO is going to be your limiting factor, and your system probbably has the same crappy IO of most computers (one slow disc, not a RAID 10 of high performance discs).

You can use a decent chunk sze (1mb upward) to reduce small reads and writes, but at the end that is al you CAN do. Or get a faster disc subsystem.

TomTom
  • 61,059
  • 10
  • 88
  • 148
  • Ah.No.Most hosters ignore the IO side. RAID maybe, but then cheap discs. Good performance is expensive. I get about 400mb/s stable IO - on 10 (!) Velociraptors. The discs alone cost nearly 3000 USD ;) – TomTom Oct 19 '10 at 20:20
0

An option you have is throttling the operation. If you e.g. bring back the buffer to a smaller size (somewhere between 4K and 1MB) and put a Thread.Sleep between the operations, you will use less resources.

Pieter van Ginkel
  • 29,160
  • 8
  • 71
  • 111