If you use streaming and pipelining, you should be reduce problem with (3) a lot, because when you stream, each object is passed along the pipeline as and when they are available and do not take up much memory and you should be able to process millions of files (though it will take time).
Get-ChildItem $directory -recurse | Measure-Object -property length -sum
I don't believe @Stej's statement, Get-ChildItem probably reads all entries in the directory and then begins pushing them to the pipeline.
, is true. Pipelining is a fundamental concept of PowerShell (provide the cmdlets, scripts, etc. support it). It both ensures that processed objects are passed along the pipeline one by one as and when they are available and also, only when they are needed. Get-ChildItem
is not going to behave differently.
A great example of this is given in Understanding the Windows PowerShell Pipeline.
Quoting from it:
The Out-Host -Paging command is a useful pipeline element whenever you
have lengthy output that you would like to display slowly. It is
especially useful if the operation is very CPU-intensive. Because
processing is transferred to the Out-Host cmdlet when it has a
complete page ready to display, cmdlets that precede it in the
pipeline halt operation until the next page of output is available.
You can see this if you use the Windows Task Manager to monitor CPU
and memory use by Windows PowerShell.
Run the following command: Get-ChildItem C:\Windows -Recurse
.
Compare the CPU and memory usage to this command: Get-ChildItem
C:\Windows -Recurse | Out-Host -Paging
.
Benchmark on using Get-ChildItem
on c:\
(about 179516 files, not milions, but good enough):
Memory usage after running $a = gci c:\ -recurse
(and then doing $a.count
) was 527,332K
.
Memory usage after running gci c:\ -recurse | measure-object
was 59,452K
and never went above around 80,000K
.
(Memory - Private Working Set - from TaskManager, seeing memory for the powershell.exe
process. Initially, it was about 22,000K
.)
I also tried with two million files (it took me a while to create them!)
Similar experiment:
Memory usage after running $a = gci c:\ -recurse
( and then doing $a.count
) was 2,808,508K
.
Memory usage while running gci c:\ -recurse | measure-object
was 308,060K
and never went above around 400,000K
. After it finished, it had to do a [GC]::Collect()
for it to return to the 22,000K
levels.
I am still convinced that Get-ChildItem
and pipelining can get you great memory improvements even for millions of files.