0

I am a programming enthusiast and novice, I am using Powershell to try to solve the following need:

  1. I need to extract the full path of files with extension .img. inside a folder with +/- 900 thousand folders and +/- million files. -/+ 900,000 img files.
  2. Each img file must be processed in an exe. that is read from a file.
    Which is better to store the result of the GetChildItem in a variable or a file?

I would greatly appreciate your guidance and support to optimize and / or find the best way to speed up processes vs. resource consumption. Thank you un advance!!
This is the code I am currently using:

$PSDefaultParameterValues['*:Encoding'] = 'Ascii'
$host.ui.RawUI.WindowTitle = “DICOM IMPORT IN PROGRESS”
#region SET WINDOW FIXED WIDTH
$pshost = get-host
$pswindow = $pshost.ui.rawui
$newsize = $pswindow.buffersize
$newsize.height = 3000
$newsize.width = 150
$pswindow.buffersize = $newsize
$newsize = $pswindow.windowsize
$newsize.height = 50
$newsize.width = 150
$pswindow.windowsize = $newsize
#endregion
#
$out = ("$pwd\log_{0:yyyyMMdd_HH.mm.ss}_import.txt" -f (Get-Date))
cls
"`n" | tee -FilePath  $out -Append
"*****************" | tee -FilePath  $out -Append
"**IMPORT SCRIPT**" | tee -FilePath  $out -Append
"*****************" | tee -FilePath  $out -Append
"`n" | tee -FilePath  $out -Append
#
# SET SEARCH FOLDERS #
"Working Folder" | tee -FilePath  $out -Append
$path1 = Read-Host "Enter folder location" | tee -FilePath  $out -Append
"`n" | tee -FilePath  $out -Append
#
#
# SET & SHOW HOSTNAME
"SERVER NAME" | tee -FilePath  $out -Append
$ht = hostname | tee -FilePath $out -Append
Write-Host $ht
Start-Sleep -Seconds 3
"`n" | tee -FilePath  $out -Append
#
#
# GET FILES
"`n" | tee -FilePath  $out -Append
#"SEARCHING IMG FILES, PLEASE WAIT..." | tee -FilePath  $out -Append
$files = $path1 | Get-ChildItem -recurse -file -filter *.img | ForEach-Object { $_.FullName }
# SHOW Get-ChildItem PROCESS ON CONSOLE
Out-host -InputObject $files 
"`n" | tee -FilePath  $out -Append
Write-Output ($files | Measure).Count "IMG FILES FOUND TO PUSH" | tee -FilePath  $out -Append
# DUMP Get-ChildIte into a file
$files > $pwd\pf
Start-Sleep -Seconds 5

# TIMESTAMP
"`n" | tee -FilePath  $out -Append
"IMPORT START" | tee -FilePath  $out -Append
("{0:yyyy/MM/dd HH:mm:ss}" -f (Get-Date)) | tee -FilePath $out -Append
"********************************" | tee -FilePath  $out -Append
"`n" | tee -FilePath  $out -Append
#
#
#SET TOOL
$ir = $Env:folder_tool
$pt = "utils\tool.exe"
#
#PROCESSING FILES
$n = 1
$pe = foreach ($file in Get-Content $pwd\pf ) {
    $tb = (Get-Date -f HH:mm:ss) | tee -FilePath  $out -Append
    $fp = "$n. $file" | tee -FilePath  $out -Append
    #
    $ep = & $ir$pt -c $ht"FIR" -i $file | tee -FilePath  $out -Append
    $as = "`n" | tee -FilePath  $out -Append
    # PRINT CONSOLE IMG FILES PROCESS
    Write-Host $tb
    Write-Host $fp
    Out-host -InputObject $ep
    Write-Host $as
    $n++
}  
#
#TIMESTAMP
"********************************" | tee -FilePath  $out -Append
"IMPORT END" | tee -FilePath  $out -Append
("{0:yyyy/MM/dd HH:mm:ss}" -f (Get-Date)) | tee -FilePath  $out -Append
"`n" | tee -FilePath  $out -Append
AHL7
  • 1
  • 1
  • 1
    With millions of files, I suggest to drop `Get-ChildItem` and avoid the pipeline, which is notoriously slow. Instead, use `[IO.Directory]::EnumerateFiles()` function with a `foreach` loop [as shown by this answer](https://stackoverflow.com/a/67531923/7571258). – zett42 Aug 23 '22 at 10:25
  • "*speed up processes vs. resource consumption*" which way? Note that each way would have a different approach. In case you want to save resources: ***do* (correctly) use the mighty [PowerShell Pipeline](https://learn.microsoft.com/powershell/module/microsoft.powershell.core/about/about_pipelines)** – iRon Aug 23 '22 at 10:53
  • What is `$pe` doing? In case you want to save memory, you should consider to directly pipe each current item in to the next cmdlet (using @Mathias' answer) rather potting up everything in `$pe` (aka memory) – iRon Aug 23 '22 at 11:02
  • 2
    @iRon Using `[IO.Directory]::EnumerateFiles()` with a `foreach` loop you can have both, speed up and less resource consumption. Speed up due to not using the pipeline and less memory usage as it returns an _enumerator_, which gets the next item only when requested (unless you actively store its output in an array). – zett42 Aug 23 '22 at 11:24
  • 1
    @zett42, good point, I presume that also works for the `ForEach` method: `[IO.Directory]::EnumerateFiles(...).ForEach{ $_ }` – iRon Aug 23 '22 at 11:52
  • 1
    @iRon Unfortunately that doesn't hold true for `.ForEach()`. I just did a test, recursively enumerating all files of a large folder. With `foreach( $path in [IO.Directory]::EnumerateFiles(...) ) { $path }`, output starts immediately. However, with `[IO.Directory]::EnumerateFiles(...).ForEach{ $_ }`, there is a big delay before output starts. Apparently `.ForEach{}` collects all output into an array first. :( – zett42 Aug 23 '22 at 12:22
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/247480/discussion-between-iron-and-zett42). – iRon Aug 23 '22 at 12:47

2 Answers2

1

Try using parallel with PoshRSJob. Replace Start-Process in Process-File with your code and note that there is no access to console. Process-File must return string. Adjust $JobCount and $inData.

The main idea is to load all file list into ConcurrentQueue, start 20 background jobs and wait them to exit. Each job will take value from queue and pass to Process-File, then repeat until queue is empty.


NOTE: If you stop script, RS Jobs will continue to run until they finished or powershell closed. Use Get-RSJob | Stop-RSJob and Get-RSJob | Remove-RSJob to stop background work


Import-Module PoshRSJob

Function Process-File
{
    Param(
       [String]$FilePath
    )
    $process = Start-Process -FilePath 'ping.exe' -ArgumentList '-n 5 127.0.0.1' -PassThru -WindowStyle Hidden
    $process.WaitForExit();
    return "Processed $FilePath"
}

$JobCount = [Environment]::ProcessorCount - 2 
$inData = [System.Collections.Concurrent.ConcurrentQueue[string]]::new(
    [System.IO.Directory]::EnumerateFiles('S:\SCRIPTS\FileTest', '*.img')
    )
 
$JobScript = [scriptblock]{
    $inQueue = [System.Collections.Concurrent.ConcurrentQueue[string]]$args[0]
    $outBag = [System.Collections.Concurrent.ConcurrentBag[string]]$args[1]
    $currentItem = $null
    while($inQueue.TryDequeue([ref] $currentItem) -eq $true)
    {
        try
        {
            # Add result to OutBag
            $result = Process-File -FilePath $currentItem -EA Stop
            $outBag.Add( $result )
        }
        catch
        {
            # Catch error
            Write-Output $_.Exception.ToString()
        }
    }
}
 

 
$resultData = [System.Collections.Concurrent.ConcurrentBag[string]]::new()
 
$i_cur = $inData.Count
$i_max = $i_cur
 
# Start jobs
$jobs = @(1..$JobCount) | % { Start-RSJob -ScriptBlock $JobScript -ArgumentList @($inData, $resultData) -FunctionsToImport @('Process-File') }
 
# Wait queue to empty
while($i_cur -gt 0)
{
    Write-Progress -Activity 'Doing job' -Status "$($i_cur) left of $($i_max)" -PercentComplete (100 - ($i_cur / $i_max * 100)) 
    Start-Sleep -Seconds 3 # Update frequency
    $i_cur = $inData.Count
}
 
# Wait jobs to complete
$logs = $jobs | % { Wait-RSJob -Job $_ } | % { Receive-RSJob -Job $_  } 
$jobs | % { Remove-RSJob -Job $_ }
$Global:resultData = $resultData
$Global:logs = $logs

$Global:resultData is array of Process-File return strings

filimonic
  • 3,988
  • 2
  • 19
  • 26
0

Which is better to store the result of the GetChildItem in a variable or a file?

If you're hoping to keep memory utilization low, the best solution is to not store them at all - simply consume the output from Get-ChildItem directly:

$pe = Get-ChildItem -Recurse -File -filter *.img |ForEach-Object {
    $file = $_.FullName
    $tb = (Get-Date -f HH:mm:ss) | tee -FilePath  $out -Append
    $fp = "$n. $file" | tee -FilePath  $out -Append
    #
    $ep = & $ir$pt -c $ht"FIR" -i $file | tee -FilePath  $out -Append
    $as = "`n" | tee -FilePath  $out -Append
    # PRINT CONSOLE IMG FILES PROCESS
    Write-Host $tb
    Write-Host $fp
    Out-host -InputObject $ep
    Write-Host $as
    $n++
}
Mathias R. Jessen
  • 157,619
  • 12
  • 148
  • 206
  • `Get-ChildItem` also creates `FileInfo` object for each file and reads file information. As TS does not need anything except FullName, `[IO.Directory]::EnumerateFiles()` is better. – filimonic Aug 23 '22 at 12:12
  • 1
    @filimonic I highly doubt instantiating the `[FileInfo]` objects are going to be the major bottleneck here, but sure - feel free to post another answer below :) – Mathias R. Jessen Aug 23 '22 at 12:14
  • for SSD listing 1M files is ~ 60 times faster - 0.7s vs 52s using GCI, so IMO, this matters. – filimonic Aug 23 '22 at 14:07
  • 1
    @filimonic does it, though? If the `tool.exe` takes 50ms to execute per file then you're looking at 12 hours 30 minutes and 0.7 seconds instead of 12 hours 30 minutes and 52 seconds :) – Mathias R. Jessen Aug 23 '22 at 14:09