10

UPDATE: The following bug seems to be resolved with PowerShell 5. The bug remains in 3 and 4. So don't process any huge files with the pipeline unless you're running PowerShell 2 or 5.


Consider the following code snippet:

function Get-DummyData() {
    for ($i = 0; $i -lt 10000000; $i++) {
        "This is freaking huge!! I'm a ninja! More words, yay!"
    }
}

Get-DummyData | Out-Null

This will cause PowerShell memory usage to grow uncontrollably. After executing Get-DummyData | Out-Null a few times, I have seen PowerShell memory usage get all the way up to 4 GB.

According to ANTS Memory Profiler, we have a whole lot of things sitting around in the garbage collector's finalization queue. When I call [GC]::Collect(), the memory goes from 4 GB to a mere 70 MB. So we don't have a memory leak, strictly speaking.

Now, it's not good enough for me to be able to call [GC]::Collect() when I'm finished with a long-lived pipeline operation. I need garbage collection to happen during a pipeline operation. However if I try to invoke [GC]::Collect() while the pipeline is executing...

function Get-DummyData() {
    for ($i = 0; $i -lt 10000000; $i++) {
        "This is freaking huge!! I'm a ninja! More words, yay!"

        if ($i % 1000000 -eq 0) {
            Write-Host "Prompting a garbage collection..."
            [GC]::Collect()
        }
    }
}

Get-DummyData | Out-Null

... the problem remains. Memory usage grows uncontrollably again. I have tried several variations of this, such as adding [GC]::WaitForPendingFinalizers(), Start-Sleep -Seconds 10, etc. I have tried changing garbage collector latency modes and forcing PowerShell to use server garbage collection to no avail. I just can't get the garbage collector to do its thing while the pipeline is executing.

This isn't a problem at all in PowerShell 2.0. It's also interesting to note that $null = Get-DummyData also seems to work without memory issues. So it seems tied to the pipeline, rather than the fact that we're generating tons of strings.

How can I prevent my memory from growing uncontrollably during long pipelines?

Side note:

My Get-DummyData function is only for demonstration purposes. My real-world problem is that I'm unable to read through large files in PowerShell using Get-Content or Import-Csv. No, I'm not storing the contents of these files in variables. I'm strictly using the pipeline like I'm supposed to. Get-Content .\super-huge-file.txt | Out-Null produces the same problem.

Phil
  • 6,561
  • 4
  • 44
  • 69
  • Sounds a bit like http://stackoverflow.com/q/30918020/258523. – Etan Reisner Jul 24 '15 at 22:38
  • The memory exhaustion part sounds like a bug. You can significantly reduce CPU time by avoiding piping/enumerating 10 million objects by using assigment, casting or property enumeration – Mathias R. Jessen Jul 24 '15 at 22:55
  • I cannot reproduce the problem with the provided code snippet. – Roman Kuzmin Jul 25 '15 at 05:52
  • @RomanKuzmin Are you using PowerShell 2.0? – Phil Jul 27 '15 at 22:30
  • I'm seeing > 2 GBs of memory usage during the execution of the second example on both V4 and V5 build 5.0.10240.16384. – Keith Hill Jul 28 '15 at 02:54
  • [Addressing the PowerShell Garbage Collection bug](http://www.jhouseconsulting.com/2017/09/25/addressing-the-powershell-garbage-collection-bug-1825) at J House Consulting points to this question and suggests the answer is to include `[System.GC]::GetTotalMemory($true) | out-null` in your loop/whatever. – Ross Patterson Jun 12 '18 at 19:10

2 Answers2

8

A couple of things to point out here. First, GC calls do work in the pipeline. Here's a pipeline script that only invokes the GC:

1..10 | Foreach {[System.GC]::Collect()}

Here's the perfmon graph of GCs during the time the script ran:

enter image description here

However, just because you invoke the GC it doesn't mean the private memory usage will return to the value you had before your script started. A GC collect will only collect memory that is no longer used. If there is a rooted reference to an object, it is not eligible to be collected (freed). So while GC systems typically don't leak in the C/C++ sense, they can have memory hoards that hold onto objects longer than perhaps they should.

In looking at this with a memory profiler it seems the bulk of the excess memory is taken up by a copy of the string with parameter binding info:

enter image description here

The root for these strings look like this:

enter image description here

I wonder if there is some logging feature that is causing PowerShell to hang onto a string-ized form pipeline bound objects?

BTW in this specific case, it is much more memory efficient to assign to $null to ignore the output:

$null = GetDummyData

Also, if you need to simply edit a file, check out the Edit-File command in the PowerShell Community Extensions 3.2.0. It should be memory efficient as long as you don't use the SingleString switch parameter.

Keith Hill
  • 194,368
  • 42
  • 353
  • 369
  • 1
    I bugged this on Connect. Vote for it there if you want - https://connect.microsoft.com/PowerShell/feedback/details/1599091/event-logging-memory-hoard-when-processing-a-large-number-of-pipeline-objects – Keith Hill Jul 28 '15 at 03:18
  • While it doesn't exactly solve my problem, I think this sheds light on the fact that it's a bug that only MS can fix. Thanks for digging so much into it. – Phil Aug 06 '15 at 02:11
  • No problem. I've espoused the benefits of streaming data through the pipeline instead of storing everything a variable - not realizing that PowerShell is essentially doing just that - to some degree. – Keith Hill Aug 06 '15 at 03:32
1

It's not at all uncommon to find that the native cmdlets don't satisfy perfectly when you're doing something unusual like processing a massive text file. Personally, I've found working with large files in Powershell is much better when you script it with System.IO.StreamReader:

$SR = New-Object -TypeName System.IO.StreamReader -ArgumentList 'C:\super-huge-file.txt';
while ($line = $SR.ReadLine()) {
    Do-Stuff $line;
}
$SR.Close() | Out-Null;

Note that you should use the absolute path in the ArgumentList. For me it always seems to assume you're in your home directory with relative paths.

Get-Content is simply meant to read the entire object into memory as an array and then outputs it. I think it just calls System.IO.File.ReadAllLines().

I don't know of any way to tell Powershell to discard items from the pipeline immediately upon completion, or that a function may return items asynchronously, so instead it preserves order. It may not allow it because it has no natural way to tell that the object isn't going to be used later on, or that later objects won't need to refer to earlier objects.

The other nice thing about Powershell is that you can often adopt the C# answers, too. I've never tried File.ReadLines, but that looks like it might be pretty easy to use, too.

Community
  • 1
  • 1
Bacon Bits
  • 30,782
  • 5
  • 59
  • 66
  • 1
    Even with the StreamReader approach, the fact that you're pushing strings through the pipeline causes the problem. Also, I don't think Get-Content returns a simple array of strings. I have used it in the past in PowerShell 2.0 to process hundreds of megabytes with negligible memory usage. – Phil Jul 27 '15 at 22:32
  • 1
    @Phil The key with the StreamReader approach is that you're not using a pipeline at all. You're reading the file line by line instead of reading the whole file and piping the contents. You're doing everything you need where I have `Do-Stuff $line;`. The problem is that you can't access two lines at once and performance may be worse since your IO may bottleneck, but in return you use basically no memory. A Google search will reveal that many people have memory problems with Get-Content, however. `Get-Content | [...]` has different memory usage than `$x = Get-Content`, and that's not clear – Bacon Bits Jul 28 '15 at 12:03
  • another way to do this is `foreach ($line in [system.io.file]::readlines('path to file')) {do-stuff}` – Robert Cotterman Apr 22 '22 at 01:42