8

I'm trying to write a script that would go through 1.6 million files in a folder and move them to the correct folder based on the file name.

The reason is that NTFS can't handle a large number of files within a single folder without a degrade in performance.

The script call "Get-ChildItem" to get all the items within that folder, and as you might expect, this consumes a lot of memory (about 3.8 GB).

I'm curious if there are any other ways to iterate through all the files in a directory without using up so much memory.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
T.Ho
  • 1,190
  • 3
  • 11
  • 16

3 Answers3

14

If you do

$files = Get-ChildItem $dirWithMillionsOfFiles
#Now, process with $files

you WILL face memory issues.

Use PowerShell piping to process the files:

Get-ChildItem $dirWithMillionsOfFiles | %{ 
    #process here
}

The second way will consume less memory and should ideally not grow beyond a certain point.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
manojlds
  • 290,304
  • 63
  • 469
  • 417
  • Thanks for the nice and simple solution. I had always thought pipelining in powershell return the entire result before processing next function. – T.Ho Sep 05 '12 at 04:58
  • 2
    This actually still requires `O(n)` memory, but if it solves the problem then I agree it's the best solution. – latkin Sep 05 '12 at 16:53
13

If you need to reduce the memory footprint, you can skip using Get-ChildItem and instead use a .NET API directly. I'm assuming you are on Powershell v2, if so first follow the steps here to enable .NET 4 to load in Powershell v2.

In .NET 4 there are some nice APIs for enumerating files and directories, as opposed to returning them in arrays.

[IO.Directory]::EnumerateFiles("C:\logs") |%{ <move file $_> }

By using this API, instead of [IO.Directory]::GetFiles(), only one file name will be processed at a time, so the memory consumption should be relatively small.

Edit

I was also assuming you had tried a simple pipelined approach like Get-ChildItem |ForEach { process }. If this is enough, I agree it's the way to go.

But I want to clear up a common misconception: In v2, Get-ChildItem (or really, the FileSystem provider) does not truly stream. The implementation uses the APIs Directory.GetDirectories and Directory.GetFiles, which in your case will generate a 1.6M-element array before any processing can occur. Once this is done, then yes, the remainder of the pipeline is streaming. And yes, this initial low-level piece has relatively minimal impact, since it is simply a string array, not an array of rich FileInfo objects. But it is incorrect to claim that O(1) memory is used in this pattern.

Powershell v3, in contrast, is built on .NET 4, and thus takes advantage of the streaming APIs I mention above (Directory.EnumerateDirectories and Directory.EnumerateFiles). This is a nice change, and helps in scenarios just like yours.

Community
  • 1
  • 1
latkin
  • 16,402
  • 1
  • 47
  • 62
  • I think using pipeline with Get-ChildItem like manojlds had suggested would achieve the same thing, but thanks for showing me how to use .Net with powershell! :). – T.Ho Sep 05 '12 at 05:01
  • Yep, get-childitem | foreach-objetc { ... } will also process only one passed item as a time. – x0n Sep 05 '12 at 14:35
  • 2
    See my edit. `get-childitem | foreach {...}` is only pseudo-streaming, it technically still requires `O(n)` memory. – latkin Sep 05 '12 at 16:49
  • Thanks for the clarification. I'm actually using Powershell V3 on Windows 8, so perhaps that's why it is enumerating through the directory and actually return an FileInfo object for the $input variable. So then, is it true if I were to run this script on Powershell V2 (Windows 7), the $input variable would not be a FileInfo object, but rather an n-size string array? – T.Ho Sep 05 '12 at 17:21
  • 1
    No, not quite. All versions will output `FileInfo` objects. What I'm saying is that in v2, internally a big string array is generated up front, and elements from that array are used one at a time to build `FileInfo` objects to send down the pipeline. Only one `FileInfo` is in memory at once, but the array of file paths is stored in its entirety, internally. In v3, no internal array is created at all, everything is fully streamed. – latkin Sep 05 '12 at 17:28
  • Just curious, is it documented somewhere that `Get-ChildItem` uses EmuerateFiles in V3? – Andy Arismendi Sep 05 '12 at 21:49
  • I read it in some email or blog or something, don't recall. This (http://blogs.msdn.com/b/powershell/archive/2009/11/04/why-is-get-childitem-so-slow.aspx) is pretty close to what I read. But you can see it explicitly by viewing `System.Management.Automation.dll` in Reflector or ildasm.exe. Look at `Microsoft.Powershell.Commands.FileSystemProvider` method `Dir`. – latkin Sep 05 '12 at 22:04
  • @latkin, your way looks awesome, however im new to the powershell and dont know about .net at all. I am facing the same issue asked in here https://stackoverflow.com/questions/48190354/compare-files-using-powershell-and-create-only-non-existing/48190635?noredirect=1#comment83366334_48190635. Since there are millions of million files , is there a way to streamline it using .net. Much appreciated. – heavyguidence Jan 11 '18 at 18:16
0

This is how I implemented it without using .Net 4.0. Only Powershell 2.0 and old-fashioned DIR-command:

It's just 2 lines of (easy) code:

cd <source_path>
cmd /c "dir /B"| % { move-item $($_) -destination "<dest_folder>" }

My Powershell Proces only uses 15MB. No changes on the old Windows 2008 server!

Cheers!