12

I am writing .NET applications running on Windows Server 2016 that does an http get on a bunch of pieces of a large file. This dramatically speeds up the download process since you can download them in parallel. Unfortunately, once they are downloaded, it takes a fairly long time to pieces them all back together.

There are between 2-4k files that need to be combined. The server this will run on has PLENTLY of memory, close to 800GB. I thought it would make sense to use MemoryStreams to store the downloaded pieces until they can be sequentially written to disk, BUT I am only able to consume about 2.5GB of memory before I get an System.OutOfMemoryException error. The server has hundreds of GB available, and I can't figure out how to use them.

Maximilian Burszley
  • 18,243
  • 4
  • 34
  • 63
Josh Dayberry
  • 123
  • 1
  • 5
  • 1
    `ConcatenatedStream` from [How do I concatenate two System.Io.Stream instances into one?](https://stackoverflow.com/a/3879231) might meet your needs, as long as you don't need random seeking. – dbc Nov 07 '18 at 16:13
  • 1
    And, make sure you are compiled to x64 (or Any CPU without the "prefer 32 bit flag", and running on x64) – Flydog57 Nov 07 '18 at 16:22
  • You could create a file of the target size and make each chunk HTTP request download directly into that file. That way there is no need to combine at all. – usr Nov 07 '18 at 20:49
  • a few comments here so I'll reply to them in order. 1. I don't see what that would buy me other than simplifying the combination logic with the provided class. – Josh Dayberry Nov 07 '18 at 21:49
  • 1
    2. it is x64, i used dumpbin to validate large addresses were supported. – Josh Dayberry Nov 07 '18 at 21:49
  • 3. This could work, but since it is a parallel download the logic to reassemble the pieces needs to support them arriving out of order. How do you suggest I handle that? – Josh Dayberry Nov 07 '18 at 21:50
  • 1
    Here is a simple program to recreate my issue. `static void Main(string[] args) { MemoryStream ms1 = new MemoryStream((int)Math.Pow(1024, 3)); MemoryStream ms2 = new MemoryStream((int)Math.Pow(1024, 3)); MemoryStream ms3 = new MemoryStream((int)(Math.Pow(1024, 3)*.95)); MemoryStream ms4 = new MemoryStream((int)Math.Pow(1024, 3)); //this errors out with an out of memory error }` – Josh Dayberry Nov 08 '18 at 17:36
  • 1
    the third memory stream is interesting. I multiplied the total number of allowed bytes by .95 and it will work, if I go up to .96 or higher it won't work. The 4th MemoryStream causes the out of memory error if the third doesn't. – Josh Dayberry Nov 08 '18 at 17:39
  • @JoshDayberry, can you please try the following code (space optimized :D) and tell me, how much could be allocated? `static void Main(){List d = new List(); Random r = new Random(); while (true) { try { var a = new byte[1024 * 1024]; r.NextBytes(a); d.Add(a); Console.WriteLine($"{d.Count} MB allocated"); } catch { Console.WriteLine("Further allocation failed.");}}}` If you cannot allocate more than 4GB with this code, you can be almost 100% sure that you are running a x86 version of your code (possible compiling to _Any CPU_ with _Prefer 32-bit_ set). – Markus Safar Nov 09 '18 at 22:08

4 Answers4

13

MemoryStreams are built around byte arrays. Arrays cannot be larger than 2GB currently.

The current implementation of System.Array uses Int32 for all its internal counters etc, so the theoretical maximum number of elements is Int32.MaxValue.

There's also a 2GB max-size-per-object limit imposed by the Microsoft CLR.

As you try to put the content in a single MemoryStream the underlying array gets too large, hence the exception.

Try to store the pieces separately, and write them directly to the FileStream (or whatever you use) when ready, without first trying to concatenate them all into 1 object.

Marcell Toth
  • 3,433
  • 1
  • 21
  • 36
  • 2
    There is an application setting which allows the creation of arrays with more of 2GB size. See https://learn.microsoft.com/en-us/dotnet/framework/configure-apps/file-schema/runtime/gcallowverylargeobjects-element – ckuri Nov 07 '18 at 17:08
  • @ckuri I believe that using that you still won't be able to create **byte** arrays larger than 2GB as it is still bounded by the max index size (which is UInt32.MaxValue). Read the remarks section: "The maximum index in any single dimension is 2,147,483,591 (0x7FFFFFC7) for byte arrays..." – Marcell Toth Nov 07 '18 at 17:20
  • 1
    I tried it, and you are right. It's not possible to initialize an array with a length larger then 2^31. So `new long[2_000_000_000]` successfully created a 16 GB array, but `new byte[3_000_000_000]` threw an OverflowException. – ckuri Nov 07 '18 at 17:41
  • I tried using multiple MemoryStreams that were all smaller than 2GB and put them in a Queue. Regardless of the size I used I could create as many as would fill about 2.5 GB and I would start getting out of memory errors when i tried and create another. – Josh Dayberry Nov 07 '18 at 21:46
  • @JoshDayberry If you post a complete, reproducible test case as another question I'd be more than happy to try to help you. I have no other good guess right now. Side note: Why are you storing `MemoryStreams` exactly? I would think you could just store the `byte[]`-s directly – Marcell Toth Nov 07 '18 at 21:57
  • I'm storing MemoryStreams because I have a ton of memory available, I can't write to disk as fast as I can download the files. I'm running two async processes, one downloads the chunks in parallel and the other pieces them together serially. Previously I was just download the chunks as files, and using the serial background process to piece them together into one large file. When I was doing that the download would finish is a third the time as the piecing together finished, which is why I thought I'd move half the IO to MemoryStreams. – Josh Dayberry Nov 07 '18 at 22:07
  • @MarcellTóth here is a super simple example that seems to recreate my issue. `static void Main(string[] args) { MemoryStream ms1 = new MemoryStream((int)Math.Pow(1024, 3)); MemoryStream ms2 = new MemoryStream((int)Math.Pow(1024, 3)); MemoryStream ms3 = new MemoryStream((int)(Math.Pow(1024, 3)*.95)); MemoryStream ms4 = new MemoryStream((int)Math.Pow(1024, 3)); //this errors out with an out of memory error }` – Josh Dayberry Nov 08 '18 at 17:34
  • 2
    @JoshDayberry You are 99% running it as *x86* then. What I'm guessing is that your app is set to **AnyCPU (prefer 32 bit)**. This will make your code run as a 32bit assembly too, see this: https://stackoverflow.com/a/12066861/10614791 Set it explicitly to x64 (there is no point in AnyCPU if it will outright crash on a 32bit system) and it will work. I reproduced your issue, this solves it. – Marcell Toth Nov 08 '18 at 22:41
3

According to the source code of the MemoryStream class you will not be able to store more than 2 GB of data into one instance of this class. The reason for this is that the maximum length of the stream is set to Int32.MaxValue and the maximum index of an array is set to 0x0x7FFFFFC7 which is 2.147.783.591 decimal (= 2 GB).

Snippet MemoryStream

private const int MemStreamMaxLength = Int32.MaxValue;

Snippet array

// We impose limits on maximum array lenght in each dimension to allow efficient 
// implementation of advanced range check elimination in future.
// Keep in sync with vm\gcscan.cpp and HashHelpers.MaxPrimeArrayLength.
// The constants are defined in this method: inline SIZE_T MaxArrayLength(SIZE_T componentSize) from gcscan
// We have different max sizes for arrays with elements of size 1 for backwards compatibility
internal const int MaxArrayLength = 0X7FEFFFFF;
internal const int MaxByteArrayLength = 0x7FFFFFC7;

The question More than 2GB of managed memory has already been discussed long time ago on the microsoft forum and has a reference to a blog article about BigArray, getting around the 2GB array size limit there.

Update

I suggest to use the following code which should be able to allocate more than 4 GB on a x64 build but will fail < 4 GB on a x86 build

private static void Main(string[] args)
{
    List<byte[]> data = new List<byte[]>();
    Random random = new Random();

    while (true)
    {
        try
        {
            var tmpArray = new byte[1024 * 1024];
            random.NextBytes(tmpArray);
            data.Add(tmpArray);
            Console.WriteLine($"{data.Count} MB allocated");
        }
        catch
        {
            Console.WriteLine("Further allocation failed.");
        }
    }
}
Markus Safar
  • 6,324
  • 5
  • 28
  • 44
  • I wasn't storing more than 2GB in a stream. I was however using multiple streams, one per chunk, to hold the data. No matter what sized I made the chunks, it always seem to peak at 2.5GB of ram usage before it would give out of memory errors. I was storing the many MemoryStreams in a Queue. – Josh Dayberry Nov 07 '18 at 21:43
  • @JoshDayberry, I see... What is the configuration of your pagefile? I know there can be some issues with memory allocation if you have the size of the pagefile configured the wrong way.I just searched for the article and found it again: [Pushing the limits of windows physical memory](https://blogs.technet.microsoft.com/markrussinovich/2008/07/21/pushing-the-limits-of-windows-physical-memory/) and [Pushing the limits of windows virtual memory](https://blogs.technet.microsoft.com/markrussinovich/2008/11/17/pushing-the-limits-of-windows-virtual-memory/). The last one may help you... – Markus Safar Nov 08 '18 at 01:14
  • The server has around 800GB of memory and we are getting stuck at around 3. Page file utilization is steady at 0%. Why would the pagefile be relevant? – Josh Dayberry Nov 08 '18 at 17:50
1

As has already been pointed out, the main problem here is the nature of MemoryStream being backed by a byte[], which has fixed upper size.

The option of using an alternative Stream implementation has been noted. Another alternative is to look into "pipelines", the new IO API. A "pipeline" is based around discontiguous memory, which means it isn't required to use a single contiguous buffer; the pipelines library will allocate multiple slabs as needed, which your code can process. I have written extensively on this topic; part 1 is here. Part 3 probably has the most code focus.

Marc Gravell
  • 1,026,079
  • 266
  • 2,566
  • 2,900
0

Just to confirm that I understand your question: you're downloading a single very large file in multiple parallel chunks and you know how big the final file is? If you don't then this does get a bit more complicated but it can still be done.

The best option is probably to use a MemoryMappedFile (MMF). What you'll do is to create the destination file via MMF. Each thread will create a view accessor to that file and write to it in parallel. At the end, close the MMF. This essentially gives you the behavior that you wanted with MemoryStreams but Windows backs the file by disk. One of the benefits to this approach is that Windows manages storing the data to disk in the background (flushing) so you don't have to, and should result in excellent performance.

Aaron Lieberman
  • 304
  • 1
  • 4