PowerShell's switch
statement has powerful, but little-known features that allow you to iterate over the lines of a file (-file
) and match lines by regular expressions (-regex
).
Not only is switch -file
convenient, it is also much faster than using cmdlets in a pipeline (see bottom section).
[double] $sum = 0
switch -regex -file file.txt {
# Note: The string to the left of each script block below ({ ... }),
# e.g., '^D', is the regex to match each line against.
# Inside the script blocks, $_ refers to the input line at hand.
# Extract number, add to sum, output the line.
'^D' { $sum += $_.Substring(2, 11); $_; continue }
# Summary line: place sum at character position 10, with 0-padding
# Note: `-replace ',', '.'` is only needed if your culture uses "," as the
# decimal mark.
'^S' { $_.Substring(0, 9) + '{0:000000000000000.00}' -f $sum -replace ',', '.'; continue }
# All other lines: pass them through.
default { $_ }
}
Note:
continue
in the script blocks short-circuits further matching for the line at hand; by contrast, if you used break
, no further lines would be processed.
- Based on a later comment, I'm assuming you want an 18-character
0
-left-padded number on the S
line at character position 10
.
With your sample file, the above yields:
F123trial text
DA00000038.95==xxx11
DA00000018.95==yyy11
DA00000018.95==zzzyy
S 000000000000076.85
Optional reading: Comparing the performance of switch -file ...
to Get-Content ... | ForEach-Object ...
Running the following test script:
& {
# Create a sample file with 100K lines.
1..1e5 > ($tmpFile = [IO.Path]::GetTempFileName())
(Measure-Command { switch -file ($tmpFile) { default { $_ } } }).TotalSeconds,
(Measure-Command { get-content $tmpFile | % { $_ } }).TotalSeconds
Remove-Item $tmpFile
}
yields the following timings on my machine, for instance (the absolute numbers aren't important, but their ratio should give you a sense):
0.0578924 # switch -file
6.0417638 # Get-Content | ForEach-Object
That is, the pipeline-based solution is about 100 (!) times slower than the switch -file
solution.
Digging deeper:
Frode F. points out that Get-Content
is slow with large files - though its convenience makes it a popular choice - and mentions using the .NET Framework directly as an alternative:
Using [System.IO.File]::ReadAllLines()
; however, given that it reads the entire file into memory, that is only an option with smallish files.
Using [System.IO.StreamReader]
's ReadLine()
method in a loop.
However, use of the pipeline in itself, irrespective of the specific cmdlets used, introduces overhead. When performance matters - but only then - you should avoid it.
Here's an updated test that includes commands that use the .NET Framework methods, with and without the pipeline (the use of intrinsic .ForEach()
method requires PSv4+):
& {
# Create a sample file with 100K lines.
1..1e5 > ($tmpFile = [IO.Path]::GetTempFileName())
(Measure-Command { switch -file ($tmpFile) { default { $_ } } }).TotalSeconds
(Measure-Command { foreach ($line in [IO.File]::ReadLines((Convert-Path $tmpFile))) { $line } }).TotalSeconds
(Measure-Command {
$sr = [IO.StreamReader] (Convert-Path $tmpFile)
while(-not $sr.EndOfStream) { $sr.ReadLine() }
$sr.Close()
}).TotalSeconds
(Measure-Command { [IO.File]::ReadAllLines((Convert-Path $tmpFile)).ForEach({ $_ }) }).TotalSeconds
(Measure-Command { [IO.File]::ReadAllLines((Convert-Path $tmpFile)) | % { $_ } }).TotalSeconds
(Measure-Command { Get-Content $tmpFile | % { $_ } }).TotalSeconds
Remove-Item $tmpFile
}
Sample results, from fastest to slowest:
0.0124441 # switch -file
0.0365348 # [System.IO.File]::ReadLine() in foreach loop
0.0481214 # [System.IO.StreamReader] in a loop
0.1614621 # [System.IO.File]::ReadAllText() with .ForEach() method
0.2745749 # (pipeline) [System.IO.File]::ReadAllText() with ForEach-Object
0.5925222 # (pipeline) Get-Content with ForEach-Object
switch -file
is the fastest by a factor of around 3, followed by the no-pipeline .NET solutions; using .ForEach()
adds another factor of 3.
Simply introducing the pipeline (ForEach-Object
instead of .ForEach()
) adds another factor of 2; finally, using the pipeline with Get-Content
and ForEach-Object
adds another factor of 2.