7

I have many large gzip files (approximately 10MB - 200MB) that I downloaded from ftp to be decompressed.

So I tried to google and find some solution for gzip decompression.

    static byte[] Decompress(byte[] gzip)
    {
        using (GZipStream stream = new GZipStream(new MemoryStream(gzip), CompressionMode.Decompress))
        {
            const int size = 4096;
            byte[] buffer = new byte[size];
            using (MemoryStream memory = new MemoryStream())
            {
                int count = 0;
                do
                {
                    count = stream.Read(buffer, 0, size);
                    if (count > 0)
                    {
                        memory.Write(buffer, 0, count);
                    }
                }
                while (count > 0);
                return memory.ToArray();
            }
        }
    }

it works well for any files below 50mb but once i have input more than 50mb I got system out of memory exception. Last position and the length of memory before exception is 134217728. I don't think it has relation with my physical memory, I understand that I can't have object more than 2GB since I use 32-bit.

I also need to process the data after decompress the files. I'm not sure if memory stream is the best approach here but I don't really like write to file and then read the files again.

My questions

  • why did I get System.OutMemoryException?
  • what is the best possible solution to decompress gzip files and do some text processing afterwards?
William Calvin
  • 625
  • 6
  • 19
  • You are loading the entire contents of the stream into memory and returning it as a byte array. What else would you expect *other* than an out of memory exception? You should not be loading it all into memory like this -- what do you ultimately intend to do with the array? Write it to a file? Whatever you intend, it should be stream-based, and not array-based. – Kirk Woll May 03 '12 at 01:06
  • well.. The exception occurs on memory.write and stuck there in 134217728.. I'm not familiar with memory management, so please bear with me. Later I will save all processed files into database, the file inside gzipped files is csv file – William Calvin May 03 '12 at 01:08
  • 3
    Sure, but your design would be better if you processed it *while* you are unzipping it. That way you wouldn't have to allocate an enormous chunk of memory to handle it. (for example, by throwing your gzip stream directly into a `StreamReader`) – Kirk Woll May 03 '12 at 01:09
  • 3
    Probably the mistake is most easily spotted in your function's prototype: `static byte[] Decompress(byte[] gzip)`. You want to take a _stream_ as a parameter, not an array. – sarnold May 03 '12 at 01:10
  • Thanks for suggestion. I will try using stream. – William Calvin May 03 '12 at 01:20
  • any final solution with full source code sample? – Kiquenet Dec 12 '12 at 21:35

4 Answers4

4

Memory allocation strategy for MemoryStream is not friendly for huge amounts of data.

Since contract for MemoryStream is to have contiguous array as underlying storage it has to reallocate array often enough for large stream (often as log2(size_of_stream)). Side effects of such reallocation are

  • long copy delays on reallocation
  • new array must fit in free address space already heavily fragmented by previous allocations
  • new array will be on LOH heap that have its quirks (no compaction, collection on GC2).

As result handling large (100Mb+) stream through MemoryStream will likely case out of memory exception on x86 systems. In addition most common pattern to return data is to call GetArray as you do which additionally requires about the same amount of space as last array buffer used for MemoryStream.

Approaches to solve:

  • The cheapest way is to pre-grow MemoryStream to approximate size you need (preferably slightly large). You can pre-compute size that is required by reading to fake stream that does not store anything (waste of CPU resources, but you'll be able to read it). Consider also returning stream instead of byte array (or return byte array of MemoryStream buffer along with length).
  • Another option to handle it if you need whole stream or byte array is to use temporary file stream instead of MemoryStream to store large amount of data.
  • More complicated approach is to implement stream that chunks underlying data in smaller (i.e. 64K) blocks to avoid allocation on LOH and copying data when stream needs to grow.
Alexei Levenkov
  • 98,904
  • 14
  • 127
  • 179
  • Yes, thanks for clarify this to me. I'm kind of understand now, memory stream was not good friend for me in this case. I thought it could help faster the performance but instead it gives me more headache. Thanks – William Calvin May 03 '12 at 03:30
1

You can try a test like the following to get a feel for how much you can write to MemoryStream before getting a OutOfMemoryException :

        const int bufferSize = 4096;
        byte[] buffer = new byte[bufferSize];

        int fileSize = 1000 * 1024 * 1024;

        int total = 0;

        try
        {
            using (MemoryStream memory = new MemoryStream())
            {
                while (total < fileSize)
                {
                    memory.Write(buffer, 0, bufferSize);
                    total += bufferSize;
                }

            }

            MessageBox.Show("No errors"); 

        }
        catch (OutOfMemoryException)
        {
            MessageBox.Show("OutOfMemory around size : " + (total / (1024m * 1024.0m)) + "MB" ); 
        }

You may have to unzip to a temporary physical file first and re-read it in small chunks, and process as you go.

Side Point : interestingly, on a Windows XP PC, the above code gives : "OutOfMemory around size 256MB" when code targets .net 2.0, and "OutOfMemory around size 512MB" on .net 4.

Moe Sisko
  • 11,665
  • 8
  • 50
  • 80
  • 1
    I already specified above. It got stuck on 134217728 roughly around 128MB if I correct. I'm not sure why this happens too early but I guess choose memory stream is my first mistake.. Thanks for your answer – William Calvin May 03 '12 at 03:28
  • Can confirm i've hit the EXACT same limit. – Kris Jan 11 '17 at 04:49
1

Do you happen to be processing files in multiple threads? That would consume a large amount of your address space. OutOfMemory errors usually aren't related to physical memory, and so MemoryStream can run out far earlier than you'd expect. Check this discussion http://social.msdn.microsoft.com/Forums/en-AU/csharpgeneral/thread/1af59645-cdef-46a9-9eb1-616661babf90. If you switched to a 64-bit process, you'd probably be more than OK for the file sizes you're dealing with.

In your current situation though, you could work with memory mapped files to get around any address size limits. If you're using .NET 4.0, it provides a native wrapper for the Windows functions http://msdn.microsoft.com/en-us/library/dd267535.aspx.

Michael Yoon
  • 1,606
  • 11
  • 9
-1

I understand that I can't have object more than 2GB since I use 32-bit

That is incorrect. You can have as much memory as you need. 32-bit limitation means you can only have 4GB (OS takes half of it) of Virtual Address Space. Virtual Address Space is not memory. Here is a nice read.

why did I get System.OutMemoryException?

Because allocator could not find contiguous address space for your object or it happens too fast and it clogs. (Most likely the first one)

what is the best possible solution to decompress gzip files and do some text processing afterwards?

Write a script that download the files, then uses tools like gzip or 7zip to decompress it and then process it. Depending on kind of processing, numbers of files and total size you will have to save them at some point to avoid this kind memory problems. Save them after unziping and process 1MB at once.

Community
  • 1
  • 1
Lukasz Madon
  • 14,664
  • 14
  • 64
  • 108
  • 6
    [The OP is correct about the 2GB *array-size* limit](http://stackoverflow.com/questions/1087982/single-objects-still-limited-to-2-gb-in-size-in-clr-4-0). Also, I think suggesting an external tool such as 7-zip completely misses the spirit of this queston. – Kirk Woll May 03 '12 at 01:22