1

Below is part my code for a script that tries to find a certain number from a folder containing thousand of XML files. Only problem is, running that script takes over an hour, which is not ideal. So either I have to find a way to make the script run faster, or pre-filter the directory somehow. Any help is greatly appreciated!

This is the code I used for my script. I already tried speeding it up by using the foreachobject pipeline instead of Foreach $file in $files, and using the xml reader functionality.

$files | ForEach-Object{ # <- loop through the files

    $filename = $_.FullName

    [xml]$doc = Get-Content $filename -ReadCount 0 # <- Read content as XML 

    #$batch_name = Select-Xml -Xml $doc -Xpath "//Batch[1]/@BatchID" # <- Use xpath  to get batch id from file, assign it to variable for comparision
    $batch_name = $doc.SelectSingleNode("//Batch[1]/@BatchID").Value
    
    if ($batch_id -eq $batch_name){ #<- Compare given batch id with each file

        Write-Host "Found file '$filename' with '$batch_name'" #<- Write the full path to the file found
        $filename | Out-File $outfile -Append
    }

    Write-Progress -PercentComplete($counter*100 / $files_count) -Activity "Files searched $counter/$files_count" -Status 'Working...'  #<- Code for progress bar
    $counter++
}

Adalwolf
  • 11
  • 1
  • `Get-Content` is known to be inefficient. Try using the .NET techniques discussed in the answer here: [PowerShell Get-Content with basic manipulations so slow](https://stackoverflow.com/questions/47349306/powershell-get-content-with-basic-manipulations-so-slow) – boxdog Aug 03 '23 at 06:35
  • Also, `Get-Content` is the wrong way of loading XML as it doesn't understand XML encodings. Using `$doc = [xml]::new(); $doc.Load(( Convert-Path $filename ))` might already speed things up. – zett42 Aug 03 '23 at 06:48
  • You might read [PowerShell scripting performance considerations](https://learn.microsoft.com/powershell/scripting/dev-cross-plat/performance/script-authoring-considerations). `$filename | Out-File $outfile -Append` should go outside the loop and be part of the stream: `$files | Get-Content -Raw | Foreach-Object { } | Set-Content $outfile`. But I guess this is just the inner loop and the outer loop iterates through a list of `$batch_name`s. In that case check: https://stackoverflow.com/a/74997412/1701026 – iRon Aug 03 '23 at 06:49
  • Ones you have a good pipeline setup, you might also looking into [`Foreach-Object -Parallel`](https://devblogs.microsoft.com/powershell/powershell-foreach-object-parallel-feature/) ([note that this requires PowerShell 7](https://stackoverflow.com/a/74268662/1701026)) – iRon Aug 03 '23 at 06:54
  • I tried using [System.IO.StreamReader] and the method zett42 suggested, neither inceased the speed at all. It's still going around 2.5 files per second, which is the fastest I've got it to run. – Adalwolf Aug 03 '23 at 07:16
  • Are you sure that the performance issue is due to the file reader. Or is it the sql query: `$doc.SelectSingleNode("//Batch[1]/@BatchID").Value`? In other words, what is the performance if you comment some of the ineer logic of out? – iRon Aug 03 '23 at 07:58
  • I tested without anything except the reader, same result. That has to be the bottleneck somehow. For the record, the XML files in question are about 50 - 300 KB. – Adalwolf Aug 03 '23 at 08:00
  • Btw. [`Write-Progress` is also quite slow](https://stackoverflow.com/q/21304282/1701026), consider something like `if (!($Counter % 100)) Write-Progress ...` to show the progress just every 100 iterations. – iRon Aug 03 '23 at 08:11
  • 1
    A faster alternative to the DOM-based XML processing using the `[xml]` (aka `XmlDocument`) class, can be the _streaming_ `XmlReader` class. Although using it from PowerShell might not be much faster, when you code the inner loop in PowerShell. To reach it's full potential, you would propably have to code the inner loop in [(embedded) C#](https://www.byteinthesky.com/powershell/how-to-run-c-sharp-code-from-powershell). – zett42 Aug 03 '23 at 08:43
  • I think that @zett42 has a point: stream the xml until you found where you are looking for (although I don't think you will need C# for that), if you do want to investigate this further, I recommend you to create a new specific "*How to stream an xml file in PowerShell*" question with an `Xml` example for this. – iRon Aug 03 '23 at 10:47
  • Anyways, if `$files | ` is slow, don't repeat that. Meaning it should be your (most) outer loop. But than: it is unclear how `$batch_name` is set. Or better: *how **often** it is set.*. In other words if you do this multiple times for a list of `$batch_name`s, the `$batch_name` should be provided as an array which should be iterated within this loop. – iRon Aug 03 '23 at 10:47
  • "$batch_name" (dumb name) represents the batch id found in the xml files. I wrote the script as such that it checks the batch id for each file, assigns it to the variable called "$batch_name" and compares that to the batch id the user inputted, to find the corresponding file. – Adalwolf Aug 03 '23 at 11:35
  • Sorry for the confusion I meant indeed `$batch_id`. As it is a single item inputted by the user I can only think of [caching](https://en.wikipedia.org/wiki/Cache_(computing)) all `$batch_name` in a separate file somehow: meaning storing a (serialized) file with details like: `@{$outfile = @{ $Date = $outfile.ModifiedDate; BatchIds = [string[]]$BatchIds } }` and only rescan the `xml` files when the date of the file changed (and update the cache file accordingly). – iRon Aug 03 '23 at 12:16

0 Answers0