2

I'm trying to build a high performance PowerShell script that will scan all files & folders on an NTFS\SAN share. I don't need file data, just the metadata info like attributes, timestamps and permissions.

I can scan files & folders via a simple PowerShell script:

$path = "\\SomeNetworkServer\Root"
Get-ChildItem -Path $path -Recurse | % {

   #Do some work like (Get-ACL -Path ($_.FullName))...

}

Problem with this approach is that it can take hours or even days on very large storage systems with millions of folders & files. I figured parallelism might help better utilize storage, CPU and networking IO to process faster. I'm not sure where (if at all) a bottleneck exists in Get-ChildItem. I assume it just runs sequentially through each directory.

My approach was to build a function that also uses recursion with the System.IO namespace to enumerate all directories and files, and also use RSJobs to create a pool of active running jobs based on the directory.

Something like:

function ExploreShare {
    param(
        [parameter(Mandatory = $true)]
        [string] $path, 
        
        [parameter(Mandatory = $true)]
        [int] $threadCount)
    Begin {}
    Process {
        [System.IO.Directory]::EnumerateDirectories($path) | % {
            $curDirectory = $_;
            
            #Do work on current directory, output to pipline....
            

            #Determine if some of the jobs finished
            $jobs = Get-RSJob
            $jobs | ? { $_.State -eq "Completed" } | Receive-RSJob
            $jobs | ? { $_.State -eq "Completed" } | Remove-RSJob
            $running = $jobs | ? { $_.State -eq "Running" };
            
            #If we exceed our threadCount quota run as normal recursion
            if ($running.count -gt $threadCount -or ($threadCount -eq 0)) {
                ExploreShare -path $curDirectory -threadCount $threadCount
            }
            else {
                #Create a new Job for the directory and its nested contents
                Start-RSJob -InputObject $curDirectory -ScriptBlock {
                    ExploreShare -path $_ -threadCount $using:threadCount
                } -FunctionsToImport @("ExploreShare") | Out-Null
            }
        }

        #Process all files in current directory in current thread\job
        [System.IO.Directory]::EnumerateFiles($path) | % {
            $curFile = $_;

            #Do work with current file, output to pipeline....
        }
    }
    End {
        #End of invocation wait for jobs to be finished and flush them out?
        $jobs = Get-RSJob
        $jobs | Wait-RSJob | Receive-RSJob
        $jobs | Remove-RSJob
    }
}

Then call it like so ExploreShare -path $path -threadCount 5

Few design challenges I'm struggling to overcome:

  1. When calling Start-RSJob, I believe that nested job has no awareness of the parent jobs resulting in an ever growing list of RSJobs per directory as ExploreShare is recursively called.

    • I tried doing something like $jobs = Get-RSJobs and passing in $jobs to Start-RSJob & ExploreShare to build awareness of running jobs in a child job (not sure if it actually works that way), it just caused my process memory to skyrocket.
  2. If I keep the parallelism to just enumerated root directories, I can get in a situation where most of them at the root are empty except a few that contain all the nested files & folders (like Department folder)


My goal was to make it so there are always only X jobs running in parallel when ExploreShare is called, and if a nested directory is being enumerated when a previous job completed, then that directory (and its files) will be processed as a new job instead of current thread.

Hopefully this makes sense ‍♂️

  • Recursive function calls is expensive, you should use a queue or concurrent queue instead if going the threading route. – Santiago Squarzon Jun 20 '23 at 19:48
  • How would I be able to list all the nested files and directories without a recursive call? – The Unique Paul Smith Jun 20 '23 at 19:53
  • 1
    @TheUniquePaulSmith, I'm guessing Santiago is thinking of processing a folder, the root being the first, queueing all the subfolders, then get the next folder in the queue, process it and queue all of its subfolders, and just keep repeating this in a loop until the queue is empty. – Darin Jun 20 '23 at 19:59
  • 3
    What version of PowerShell are you using? If you can use PowerShell 7+ this is much easier with `ForEach-Object -Parallel` – Santiago Squarzon Jun 20 '23 at 20:03
  • @Darin, if I went a queued processing route, wouldn't that be the same as sequential with Get-ChildItem -Recursive? Also, currently stuck using PS 5.1 can't guarantee 7+ will be available where this will run. Unless you are assuming we would launch multiple processes against the queue (like decoupled architecture for scanning & processing) – The Unique Paul Smith Jun 20 '23 at 20:12
  • @TheUniquePaulSmith, as far as running PowerShell 7, you can take a look at [this answer](https://stackoverflow.com/a/73494335/4190564) where I demonstrated running the Zip version PS 7 on a remote system. I have ran this version of PS 7 in a Windows PE environment, so it appears to run on just about any version of Windows. As for `Get-ChildItem -Recursive`, you need to focus that question towards Santiago. I'm not sure how the performance would differ, but I would bet money the pipeline is what kills it performance. – Darin Jun 20 '23 at 20:41
  • 1
    Speed is going to depend on the number of cores in the microprocessor and type of hard drive. When performing parallel operations your gain is going to increase until all the cores in micro are used and then gain is going to decrease. The OS is multiprocessor and switching between parallel task you do not get as much gain after all the cores are used. A hard drive with moving heads you will loose gains after all heads are used. A solid state hard drive do not have moving heads so any parallel operations will not slow down algorithm, and will not speed up algorithm. – jdweng Jun 20 '23 at 23:28
  • If these are SMB shares from machines with PowerShell install, have you tried PowerShell Remoting? `Invoke-Command -ComputerName ...` – lit Jun 21 '23 at 13:49

2 Answers2

2

I would recommend you to use ThreadJob if doing multi-threading in Windows PowerShell 5.1:

Install-Module ThreadJob -Scope CurrentUser

Using a ConcurrentQueue<T> this is how your code would look like, but note, I'm certain that processing this in parallel will be slower than linear enumeration, for linear enumeration see the next example.

$threadLimit = 6 # How many threads should run concurrently ?
$queue = [System.Collections.Concurrent.ConcurrentQueue[System.IO.DirectoryInfo]]::new()
$item = Get-Item '\\SomeNetworkServer\Root'
$queue.Enqueue($item)

$target = $null

# assign the loop expression to a variable here if you need to capture output
# i.e.: `$result = while (....)`
while ($queue.TryDequeue([ref] $target)) {
    try {
        $enum = $target.EnumerateFileSystemInfos()
    }
    catch {
        # if we cant enumerate this folder go next
        Write-Warning $_
        continue
    }

    $jobs = foreach ($item in $enum) {
        Start-ThreadJob {
            $queue = $using:queue
            $item = $using:item

            # do stuff with `$item` here (Get-Acl or something...)
            # it can be a file or directory at this point
            $item

            # if `$item` is a directory
            if ($item -is [System.IO.DirectoryInfo]) {
                # enqueue it
                $queue.Enqueue($item)
            }
        } -ThrottleLimit $threadLimit
    }

    # wait for them before processing the next set of stuff
    $jobs | Receive-Job -Wait -AutoRemoveJob
}

For linear processing which in my opinion will for sure be faster than the example above, you can use a normal Queue<T> as there is no need to handle thread safety. In addition, this method is also faster than using Get-ChildItem -Recurse.

$queue = [System.Collections.Generic.Queue[System.IO.DirectoryInfo]]::new()
$item = Get-Item '\\SomeNetworkServer\Root'
$queue.Enqueue($item)

# assign the loop expression to a variable here if you need to capture output
# i.e.: `$result = while (....)`
while ($queue.Count) {
    $target = $queue.Dequeue()
    try {
        $enum = $target.EnumerateFileSystemInfos()
    }
    catch {
        # if we cant enumerate this folder go next
        Write-Warning $_
        continue
    }

    foreach ($item in $enum) {
        # do stuff with `$item` here (Get-Acl or something...)
        # it can be a file or directory at this point
        $item

        # if `$item` is a directory
        if ($item -is [System.IO.DirectoryInfo]) {
            # enqueue it
            $queue.Enqueue($item)
        }
    }
}
Santiago Squarzon
  • 41,465
  • 5
  • 14
  • 37
  • 1
    Thank you for this, I tried your linear enumeration against a test share with 2k folders + 15k files and it's about the same time (7.5 minutes) as Get-ChildItem -Recurse. It's the enumeration part i'm trying to speed up. I have a partial working modified example of my ExploreShare function that creates a new job only for folders at the root, and it executes in 2.5 minutes instead of 7. But I can't predict which folders have many nested items, so i'm trying to implement rolling parallelism in some form – The Unique Paul Smith Jun 20 '23 at 21:11
0

Thanks to Santiago-Squarzon for pointing me in the right direction. A recursive function call is not good and subject to stack\depth overflow.

I did an adaptation of their sample, where top level directories are processed in parallel which helped achieve better results.

As another commenter mentioned using PowerShell V7 with ForEach-Object -Parallel also helped, and I guess will be a requirement for this to work.

Take the following example:

    function Explore-Shares {
    param(
        [Parameter(Mandatory = $true)]
        [String] $Path,

        [Parameter(Mandatory = $false)]
        [Int] $ThrottleLimit = 6
    )
    Begin {
        $topDirectories = [System.Collections.Generic.List[string]]::new()
        
        # Get all top level directories
        $topDirectories.AddRange([System.IO.Directory]::EnumerateDirectories($path));
        
        # Process ACL for top level directories
        $topDirectories | ForEach-Object {
            # Single-thread process top level directories
        }
        
        #Process top level files (if any)
        [System.IO.Directory]::EnumerateFiles($path) | ForEach-Object {
            # Single-thread process top level files
        }
    }
    Process {
        #Process each top level folder as parallel job
        $topDirectories | ForEach-Object -Parallel {
            $target = $null
            $queue = [System.Collections.Generic.Queue[string]]::new()
            $queue.Enqueue($_);

            while ($queue.TryDequeue([ref] $target)) {
                try {
                    [System.IO.Directory]::EnumerateDirectories($target) | ForEach-Object {
                        $curDirectory = $_;
                        #Do some activities against the folder

                        #Enqueue for depth processing
                        $queue.Enqueue($curDirectory)
                    }
            
                    [System.IO.Directory]::EnumerateFiles($target) | ForEach-Object {
                        $curFile = $_;
                        # Do some file activities
                    }
                }
                catch {
                    # if we cant enumerate this item go to next
                    Write-Warning $_
                    continue
                }
            }
        } -ThrottleLimit $threadLimit 
    }
}

function Explore-Shares-Legacy {
    param(
        [Parameter(Mandatory=$true)]
        [string] $Path
    )
    Process {
        Get-ChildItem -Path $Path -Recurse | ForEach-Object {
            
            # Do some activities
        }
    }
}

Running them against a sample data set shows the improvement: enter image description here

The downside to this is if a network share has a heavy weight folder structure for a small subset of root folders (which most probably do), then the whole process will be about the same as a single-threaded sequential scan.

Wish there was a way to query the content depth of a given folder so that it can be determined to launch as a separate threaded job. Spinning up a thread per folder (especially empty or leaf folders) is actually slower and negates performance.