I'm trying to build a high performance PowerShell script that will scan all files & folders on an NTFS\SAN share. I don't need file data, just the metadata info like attributes, timestamps and permissions.
I can scan files & folders via a simple PowerShell script:
$path = "\\SomeNetworkServer\Root"
Get-ChildItem -Path $path -Recurse | % {
#Do some work like (Get-ACL -Path ($_.FullName))...
}
Problem with this approach is that it can take hours or even days on very large storage systems with millions of folders & files. I figured parallelism might help better utilize storage, CPU and networking IO to process faster. I'm not sure where (if at all) a bottleneck exists in Get-ChildItem. I assume it just runs sequentially through each directory.
My approach was to build a function that also uses recursion with the System.IO namespace to enumerate all directories and files, and also use RSJobs to create a pool of active running jobs based on the directory.
Something like:
function ExploreShare {
param(
[parameter(Mandatory = $true)]
[string] $path,
[parameter(Mandatory = $true)]
[int] $threadCount)
Begin {}
Process {
[System.IO.Directory]::EnumerateDirectories($path) | % {
$curDirectory = $_;
#Do work on current directory, output to pipline....
#Determine if some of the jobs finished
$jobs = Get-RSJob
$jobs | ? { $_.State -eq "Completed" } | Receive-RSJob
$jobs | ? { $_.State -eq "Completed" } | Remove-RSJob
$running = $jobs | ? { $_.State -eq "Running" };
#If we exceed our threadCount quota run as normal recursion
if ($running.count -gt $threadCount -or ($threadCount -eq 0)) {
ExploreShare -path $curDirectory -threadCount $threadCount
}
else {
#Create a new Job for the directory and its nested contents
Start-RSJob -InputObject $curDirectory -ScriptBlock {
ExploreShare -path $_ -threadCount $using:threadCount
} -FunctionsToImport @("ExploreShare") | Out-Null
}
}
#Process all files in current directory in current thread\job
[System.IO.Directory]::EnumerateFiles($path) | % {
$curFile = $_;
#Do work with current file, output to pipeline....
}
}
End {
#End of invocation wait for jobs to be finished and flush them out?
$jobs = Get-RSJob
$jobs | Wait-RSJob | Receive-RSJob
$jobs | Remove-RSJob
}
}
Then call it like so
ExploreShare -path $path -threadCount 5
Few design challenges I'm struggling to overcome:
When calling Start-RSJob, I believe that nested job has no awareness of the parent jobs resulting in an ever growing list of RSJobs per directory as ExploreShare is recursively called.
- I tried doing something like
$jobs = Get-RSJobs
and passing in $jobs to Start-RSJob & ExploreShare to build awareness of running jobs in a child job (not sure if it actually works that way), it just caused my process memory to skyrocket.
- I tried doing something like
If I keep the parallelism to just enumerated root directories, I can get in a situation where most of them at the root are empty except a few that contain all the nested files & folders (like Department folder)
My goal was to make it so there are always only X jobs running in parallel when ExploreShare is called, and if a nested directory is being enumerated when a previous job completed, then that directory (and its files) will be processed as a new job instead of current thread.
Hopefully this makes sense ♂️