1

I'm getting the contents of a file and keeping just the lines that match a regex or empty line. But writing the results, e.g. smaller amount of data, is taking ages... Here is the code in question (I've added a few lines for debugging/measuring):

$original = Get-Content "$localDir\$ldif_file"
(Measure-Command -Expression { $original | Out-File "$localDir\Original-$ldif_file" }).TotalSeconds
$lines = ($original | Measure-Object -Line).Lines
"lines of `$original = $lines"

# Just keep lines of interest:
$stripped = $original | select-string -pattern '^custom[A-Z]','^$' -CaseSensitive
$lines = ($stripped | Measure-Object -Line).Lines
"lines of `$stripped = $lines"
(Measure-Command -Expression { $stripped | Out-File "$localDir\Stripped-$ldif_file" }).TotalSeconds

"done"

Problem: it takes for the smaller ($stripped) data 342 seconds to be written to a file (about 30 times longer than the $original data)! See output below:

11.5371677
lines of $original = 188715
lines of $stripped = 126404
342.6769547
done

Why is the Out-File of $stripped so much slower than the one of $original? How to improve it?

Thanks!

Chris
  • 88
  • 1
  • 11
  • Interesting that select-string's matchinfo objects and out-file take so much longer. I guess out-file does some extra rendering. – js2010 Sep 06 '21 at 23:54

3 Answers3

3

To complement Mathias' helpful answer:

  • In PowerShell 7+, Select-String now supports the -Raw switch, which outputs just strings (the matching lines), which should greatly speed up the command.

    • In Windows PowerShell, less efficiently, you can enclose the Select-String call in (...).Line to get the lines as strings only.

    • Also note that Select-String will be much faster if you directly pass it a file path (so that it reads the file itself) rather than piping individual lines via Get-Content.

  • Generally, for writing objects that are already strings, Set-Content is the better - and faster - choice compared to Out-File.

    • See this answer for background information and the bottom section of this answer for a performance comparison.

    • Character-encoding caveat (see this answer for background):

      • In Windows PowerShell, Set-Content defaults to ANSI encoding, whereas Out-File defaults to "Unicode" (UTF-16LE); use -Encoding as needed.
      • Fortunately, PowerShell [Core] 6+ uses a consistent default, namely UTF-8 without BOM.
  • Passing collections through the pipeline can be slow; for collections already in memory in full it is noticeably faster to pass them as as a whole, as an argument instead - assuming the target cmdlet supports that - Set-Content's -Value parameter does.

To put it all together:

# *PowerShell 7*: Use -Raw to directly get the lines as string.
$stripped = $original | 
  Select-String -Raw -Pattern '^custom[A-Z]','^$' -CaseSensitive

# *Windows PowerShell*: Use (...).Line to get the lines as strings.
$stripped = ($original | 
  Select-String -Pattern '^custom[A-Z]','^$' -CaseSensitive).Line

$lines = $stripped.Count # Simply count the array elements == number of lines.
"lines of `$stripped = $lines"

(Measure-Command -Expression { 
  Set-Content "$localDir\Stripped-$ldif_file" -Value $stripped
}).TotalSeconds
mklement0
  • 382,024
  • 64
  • 607
  • 775
  • 1
    Thanks for your complete answer. I need to stick to Windows Powershell for now, but with your recommendations (.Line and Set-Contents) now the file is written under a second! – Chris Mar 23 '20 at 09:13
1

You're really comparing apples and oranges here.

$original contains 189K strings, but $stripped contains 126K MatchInfo objects that will have to be converted to strings one-by-one in the pipeline.

Use the -cmatch operator instead to retain the original string input values and you'll find it's much faster to output them to a file:

$original = Get-Content "$localDir\$ldif_file"
(Measure-Command -Expression { $original | Out-File "$localDir\Original-$ldif_file" }).TotalSeconds
$lines = ($original | Measure-Object -Line).Lines
"lines of `$original = $lines"

# Just keep lines of interest:
$stripped = $original |Where-Object {$_ -cmatch '^custom[A-Z]' -or $_ -like ''}
$lines = ($stripped | Measure-Object -Line).Lines
"lines of `$stripped = $lines"
(Measure-Command -Expression { $stripped | Out-File "$localDir\Stripped-$ldif_file" }).TotalSeconds

"done"
Mathias R. Jessen
  • 157,619
  • 12
  • 148
  • 206
  • Thanks, the file writing is indeed much faster but the pattern matching in this case is taking much longer. There's no real improvement overall. – Chris Mar 23 '20 at 09:15
0

"Burst Mode"

To boost the Set-Content and Out-File cmdlets with another factor 2 on top of what is already answered here. Try this Create-Batch cmdlet:

Install-Script -Name Create-Batch

$lines |Create-Batch |Set-Content .\lines.txt

This creates a single batch (array) containing all the items
The result of this statement is the same as: Get-Process |Set-Content .\Process.txt
But note that this appears (for yet unknown reason) about twice as fast
even if you partual limit the memory usage by setting e.g. -Size 10000.
See: #18070 Possible Set/Add-Content performance improvement

iRon
  • 20,463
  • 10
  • 53
  • 79