2

I would like to know the difference between the below two nearly identical statements:

Get-Content c:\temp\myfile.txt -ReadCount 10 | ForEach-Object { $_ -match 'my_string' }
Get-Content c:\temp\myfile.txt | ForEach-Object { $_ -match 'my_string' }

The first one returns the lines from myfile.txt in which the matched substring exists whereas the second statement returns true or false based on the match for each line passed to ForEach.

Why is this behaviour not intuitive (especially the first one)? Without running the code I would have gone with the true/false for the output for both statements.

Thanks.

Steve
  • 337
  • 4
  • 11
  • 3
    In the first case you do not pass one line at a time to the pipeline. Instead there are ten lines at a time. That's why the `-match` operator switches to its filter mode and outputs the filtered lines. – Olaf Aug 14 '21 at 21:36
  • 2
    You may read more about in the help topic [about_Comparison_Operators](https://learn.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_comparison_operators?view=powershell-7.1#matching-operators) – Olaf Aug 14 '21 at 21:38
  • 1
    Thanks Olaf. I thought -readcount in Get-Content pertains only to File I/O operations (to read 10 lines in one go, as opposed to reading one line from the file at a time, which would otherwise result in slow I/O). I didn't know that all the 10 lines are passed as single array/collection into foreach - I was thinking even with 10 lines read in one go, each of them are passed to foreach one at a time. Does that mean internally Get-Content passes 10 lines as an array into foreach using the comma operator with the syntax ,$TempArrayOfTenLines ? (NB. There is a comma before the $ sign. – Steve Aug 14 '21 at 21:48
  • 3
    @Steve it is a valid question, you could see for yourself that `X` lines are passed through the pipeline by doing `{ $_.Count }` which also explains why on the `ReadCount` example `-matches` behaves like it is doing. – Santiago Squarzon Aug 14 '21 at 22:04

1 Answers1

4

Olaf has provided the crucial pointer in a comment; let me flesh it out:

  • Get-Content's -ReadCount parameter sends arrays (batches) of lines read from the input file through the pipeline.

  • Therefore, the automatic $_ variable in the receiving ForEach-Object call then refers to an array of lines rather than a single line, as would be the case without -ReadCount.

  • With an array (collection) as the LHS, PowerShell's comparison operators such as -match act as filters and return the sub-array of matching items rather than a Boolean.

    • In the case of -match, specifically, using an array-valued LHS additionally means that the automatic $Matches variable, which reports what text the -match operation captured, is not populated

To put it differently:

While using Get-Content -ReadCount can speed up processing of text file in combination with ForEach-Object, you need to iterate over the elements of the array reported in $_ to get line-by-line processing:

Get-Content c:\temp\myfile.txt -ReadCount 10 | 
  ForEach-Object { foreach ($line in $_) { $line -match 'my_string' } }

Note that it is the common -OutBuffer parameter that would have behave as you expected: -OutBuffer $n means that the cmdlet at hand collects 1 + $n objects before outputting to the pipeline, at which point, however, they are output one by one, as usual.

That said, unless the value of $n is large, the rarely used -OutBuffer parameter provides no performance benefit, and may even slow things down a bit.

mklement0
  • 382,024
  • 64
  • 607
  • 775
  • Does that mean -Readcount perform 2 distinct functions here - 1. Read 'n' lines at a time from the file (for faster I/O and 2. pass those 'n' lines as an array into foreach instead of iterating over that array first before moving the items individually into foreach? – Steve Aug 15 '21 at 07:51
  • 1
    @Steve, yes, an _array_ of 10 lines is passed to each `ForEach-Object` call, so that `$_` then refers to the entire array. Is that what you meant? – mklement0 Aug 15 '21 at 14:55
  • how would this streaming behaviour change if Where-Object is used as in pace of Foreach? e.g. Get-Content c:\temp\myfile.txt -ReadCount 10 | Where-Object { $_ -match 'my_string' } I am getting not so convincing output. My input file is 22 GB by the way. – Steve Aug 16 '21 at 07:57
  • 1
    @Steve, it wouldn't change: the behavior is built into `Get-Content`, so which cmdlet receives its output doesn't matter. Unless you can really process lines in batches with `-ReadCount` - without needing to look at lines individually - It's better to avoid `Get-Content` for processing large files, and use [`switch`](https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_Switch)`-File` instead, possibly combined with a `System.IO.StreamWriter` instance for writing to a file. See [this answer](https://stackoverflow.com/a/64938276/45375) for more information. – mklement0 Aug 16 '21 at 12:27
  • I did try the Switch -file option on the same 22 GB file. In terms of performance, it wasn't much different. Get-Content looks more readable code than the not so obvious Switch -File syntax hence for now I am sticking to the Get-Content method. – Steve Aug 18 '21 at 09:20