2

I wrote a piece of testing code using AsParallel to concurrently read big files. It causes memory leak. It seems GC doesn’t recycle the unused objects as expected. Please see the code snippet.

        static void Main(string[] args)
        {
            int[] array = new[] { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10 };

            //foreach (var element in array) // No leaks
            //{
            //    ReadBigFile();
            //}

            array.AsParallel().ForAll(l => ReadBigFile()); // Memory leak

            Console.ReadLine();
        }

        private static void ReadBigFile()
        {
            List<byte[]> lst = new List<byte[]>();
            var file = "<BigFilePath>"; // 600 Mb
            lst.Add(File.ReadAllBytes(file));
        }

I tried this with both synchronized and parallel way. The synchronized foreach runs OK as no memory leaks. But when I use AsParallel to read the file concurrently, memory leak happens as it took 6 GB size of memory and never go back down.

Please help to identify what the root cause is? And what the write things to do if I want to concurrently complete the same task? Thank you.

PS: The issue happens in both dotnet framework (4.6.1) and dotnet core (6.0).

Theodor Zoulias
  • 34,835
  • 7
  • 69
  • 104
Frank Feng
  • 21
  • 2
  • FYI: [Understanding garbage collection in .NET](https://stackoverflow.com/questions/17130382/understanding-garbage-collection-in-net). I wonder if it's simply not progressing the list to the final generation as quickly as you expect, so the objects aren't garbage collected. Perhaps you could experiment with [`AddMemoryPressure`](https://learn.microsoft.com/en-us/dotnet/api/system.gc.addmemorypressure?view=net-6.0) to see if it causes your data to be disposed faster? Also, unless you really need to, I would suggest streaming the file data rather than loading it all into memory. – ProgrammingLlama Apr 13 '22 at 02:27
  • @DiplomacyNotWar Thanks for the info. But the whole memory is up to 6GB and I also tried to make it 10GB which is 80% of the system memory, so I think it should be the right time to recycle. Also when I don't use AsParallel way, it recycles immediately. About the streaming, I didn't use it is because this is a testing. I wanted to quick reproduced a high memory usage and see if GC works as expected. This way makes thing evident. – Frank Feng Apr 13 '22 at 02:40
  • I don't really see how there could be a memory leak here, to be honest. The place where you'd get a memory leak is in unmanaged code, but `File.ReadAllBytes` wraps that up for you. Everything else is managed, so it should be garbage collected correctly. If you run the same `array.AsParallel().` code a second time, does the memory further increase or does it remain? I tested it myself, and the application went from 1.6 MB to 18 GB after the first run, and then 18 GB again after the second execution of that same method. – ProgrammingLlama Apr 13 '22 at 02:45
  • 1
    `array.AsParallel().ForAll(l => ReadBigFile()); Console.WriteLine(System.GC.GetTotalMemory(false)); System.GC.Collect(); Console.WriteLine(System.GC.GetTotalMemory(false));` first wrote 18886915096 (bytes) and then 555768 (bytes). – ProgrammingLlama Apr 13 '22 at 02:51
  • @DiplomacyNotWar When I tried adding GC.Collect after Parallel reading, the memory got recycled. Now I realized it should not be a memory leak, but also there is still questions here. Why do I have to explicitly call GC.Collect()? How should I do if it can automatically release the memory, just as it does for synchronize execution? – Frank Feng Apr 13 '22 at 03:11
  • @DiplomacyNotWar Initially I thought it was a memory leak is just because it won't automatically release the memory, even the memory has allocated for 10 GBs. To make it auto release is quite important to me, because this code happens in our production code which is WebApi projects. If it won't auto release the memory, then it can be harmful as similar as a memory leak. – Frank Feng Apr 13 '22 at 03:17
  • 1
    Personally, I would suggest reevaluating your approach to solving whatever problem this solves. Is there a way to do it that won't require so much memory to be allocated simultaneously? You could call `GC.Collect()`, but that seems like a bandaid to me rather than a solution. When to call it is discussed [here](https://stackoverflow.com/questions/478167/when-is-it-acceptable-to-call-gc-collect). – ProgrammingLlama Apr 13 '22 at 03:30
  • @DiplomacyNotWar This code is actually a simulator which demos the situation of production code. Just want to know how to deal with large objects concurrent without memory retaining concern – Frank Feng Apr 13 '22 at 05:07
  • I'd say there is no one-size-fits-all way to deal with large objects concurrently without caring about memory use. As always, memory will be freed when more memory is needed. In parallel operations, loading large amounts of stuff into memory concurrently will mean that all memory is referenced/in scope so it can't just be disposed of. Once the parallel operations have finished and the object is no longer referenced, the memory will be freed as more memory is required. – ProgrammingLlama Apr 13 '22 at 05:30
  • Frank what do you mean with *"No leaks"* in the commented `foreach` code block? Do you mean that printing the `GC.GetTotalMemory(false)` after the `foreach` shows a number close to zero? – Theodor Zoulias Apr 13 '22 at 05:31
  • @TheodorZoulias I mean after ```foreach``` the memory usage is very low (50Mb), but after ···AsParallel.ForAll··· the memory usage is up to 6GB and not recycled automatically. – Frank Feng Apr 13 '22 at 05:40
  • Well, that ​50 MB are not recycled automatically, so you can't say that the synchronous `foreach` *"is working perfectly"* either. Actually I tried to reproduce your tests, and immediately after the `foreach` the reported memory usage was 600 MB, not 50. – Theodor Zoulias Apr 13 '22 at 05:59

2 Answers2

3

The objects allocated to hold your 600MB file are considered "large objects" and as such are allocated on the Large Object Heap.

To clean up these objects, a Generation 2 collection needs to occur. This doesn't happen as often as Generation 0 collections for short-lived objects. A refresher course on how the Garbage Collector works is a good idea to understand this.

The reason GC.Collect() "frees" this memory is because calling it with no arguments performs a full collection of every generation.. including the Large Object Heap in Gen 2.

To address your concerns about memory in production - you should consider streaming these files in if possible. If not, you will need to carefully batch your files because crunching through hundreds of half-gig files in parallel is likely to cripple you in both CPU and IO depending on the environment. The runtime can only ask for so much memory from the OS.

Simon Whitehead
  • 63,300
  • 9
  • 114
  • 138
  • I use such big file is just for simulation and quickly see the result. In our production it's not large file bug big objects (300k per request), so there will be no streaming applicable. What I want to confirm is that if I use AsParallel, is GC able to automatically recycle the memory in time ? But now what I see is when I use synchronize foreach, then GC is working perfectly, the max memory allocation would be more than 50 Mb. But using AsParallel to do this, I see the memory was up to 6 GB and I never see it went down until I explicitly call GC.Collect(). – Frank Feng Apr 13 '22 at 03:47
  • 1
    If the memory gets freed up when you call GC.Collect, then it's not a leak. It will get collected normally when the GC runs. – Ilian Apr 13 '22 at 04:05
  • @ilian But I don't really see it get collected. And why the synchronized call get collected normally? – Frank Feng Apr 13 '22 at 04:40
  • 1
    @simon For the synchronize foreach excecution, it's the same large objects. It should be fall into Generation 2 too right? But Why this case the GC automatically works pefect? – Frank Feng Apr 13 '22 at 05:03
  • 1
    @FrankFeng Think about it: if you do it in parallel, you're doing it in parallel, so all objects are required in memory at the same time. You can't just dispose of one mid-way through. If you do it sequentially then you'll have no references to the object from the previous loop, which makes it available for moving to the next generation and ultimately garbage collection. Side note: the default capacity of a list is 4 items, and it doubles/copies each time you need more. I'd expect it to double ~28 times for a list containing 600 MB's worth of individual bytes. – ProgrammingLlama Apr 13 '22 at 05:22
  • You can be explicit with the list's initial capacity (e.g. `var lst = new List(fileSize);`) which will mean that all this resizing isn't necessary. Another thing is that `ReadAllBytes` already returns an array, so you might not actually need a list in the first place. – ProgrammingLlama Apr 13 '22 at 05:24
0

@DiplomacyNotWar I like your previous comment

If you do it sequentially then you'll have no references to the object from the previous loop, which makes it available for moving to the next generation and ultimately garbage collection. Blockquote

Then I modified the code as

        int[] array = new[] { 1, 2 };
        for (int i = 0; i < 5; i++)
        {
            array.AsParallel().ForAll(l => ReadBigFile()); 
        }

No I can see the memory allocation is only 1.1GB, which should be the memory sized needed for last round of the loop. So I think now I'm convinced, it's just a timing issue of GC and not real memory leaks. Thank you @DiplomacyNotWar very very much!

Frank Feng
  • 21
  • 2