3

Suppose I have a process that generates a collection of objects. For a very simple example, consider $(1 | get-member). I can get the number of objects generated:

PS C:\WINDOWS\system32> $(1 | get-member).count
21

or I can do something with those objects.

PS C:\WINDOWS\system32> $(1 | get-member) | ForEach-object {write-host $_.name}
CompareTo
Equals
...

With only 21 objects, doing the above is no problem. But what if the process generates hundreds of thousands of objects? Then I don't want to run the process once just to count the objects and then run it again to execute what I want to do with them. So how can I get a count of objects in a collection sent down the pipeline?

A similar question was asked before, and the accepted answer was to use a counter variable inside the script block that works on the collection. The problem is that I already have that counter and what I want is to check that the outcome of that counter is correct. So I don't want to just count inside the script block. I want a separate, independent measure of the size of the collection that I sent down the pipeline. How can I do that?

NewSites
  • 1,402
  • 2
  • 11
  • 26

1 Answers1

6

If processing and counting is needed:

Doing your own counting inside a ForEach-Object script block is your best bet to avoid processing in two passes.

The problem is that I already have that counter and what I want is to check that the outcome of that counter is correct.

ForEach-Object is reliably invoked for each and every input object, including $null values, so there should be no need to double-check.

If you want a cleaner separation of processing and counting, you can pass multiple -Process script blocks to ForEach-Object (in this example, { $_ + 1 } is the input-processing script block and { ++$count } is the input-counting one):

PS> 1..5 | ForEach-Object -Begin { $count = 0 } `
                          -Process { $_ + 1 }, { ++$count } `
                          -End { "--- count: $count" }

2
3
4
5
6
--- count: 5

Note that, due to a quirk in ForEach-Object's parameter binding, passing -Begin and -End script blocks is actually required in order to pass multiple -Process (per-input-object) blocks; pass $null if you don't actually need -Begin and/or -End - see GitHub issue #4513.

Also note that the $count variable lives in the caller's scope, and is not scoped to the ForEach-Object call; that is, $count = 0 potentially updates a preexisting $count variable, and, if it didn't previously exist, lives on after the ForEach-Object call.


If only counting is needed:

Measure-Object is the cmdlet to use with large, streaming input collections in the pipeline[1]:

The following example generates 100,000 integers one by one and has Measure-Object count them one by one, without collecting the entire input in memory.

PS> (& { $i=0; while ($i -lt 1e5) { (++$i) } } | Measure-Object).Count
100000

Caveat: Measure-Object ignores $null values in the input collection - see GitHub issue #10905.

Note that while counting input objects is Measure-Object's default behavior, it supports a variety of other operations as well, such as summing -Sum and averaging (-Average), optionally combined in a single invocation.


[1] Measure-Object, as a cmdlet, is capable of processing input in a streaming fashion, meaning it counts objects it receives one by one, as they're being received, which means that even very large streaming input sets (those also created one by one, such as enumerating the rows of a large CSV file with Import-Csv) can be processed without the risk of running out of memory - there is no need to load the input collection as a whole into memory. However, if (a) the input collection already is in memory, or (b) it can fit into memory and performance is important, then use (...).Count.

mklement0
  • 382,024
  • 64
  • 607
  • 775
  • 1) You say no reason to check because `ForEach-object` is reliably invoked. But the need for checking is because of potential error in my script block. So I think I see now that because of how piping works, if I need to run process B on output of process A and check process B for such errors, then I do need to (a) save output of process A to an array and get size of array, (b) separately pipe output of process A to process B with counter in process B, and (c) compare size from (a) with counter from (b). This requires running A twice, but that's the cost of error checking. Does that make sense? – NewSites Nov 03 '19 at 14:43
  • 2) You say to use `measure-object`. But why do that instead of just `count`? in other words, why do `($array | measure-object).count` when you can just do `$array.count`? – NewSites Nov 03 '19 at 14:48
  • @NewSites: Re 2) `Measure-Object`, as a _cmdlet_, is capable of processing input in a _streaming_ fashion, meaning it counts objects _one by one_, as they're being received, which means that even very large input sets - also created one by one, such as enumerating the rows of a large CSV file with `Import-Csv` - can be processed without the risk of running out of memory - there is no need to load the input collection _as a whole_ into memory. If (a) the input collection already _is_ in memory, or (b) it _can fit_ into memory and performance is important, then do use `.Count`. – mklement0 Nov 03 '19 at 15:02
  • Re 1) I'm not sure I fully understand, but note that you can perform error handling inside of `ForEach-Object` script blocks. That is, you can detect / catch / ignore errors as needed and perform conditional counting. – mklement0 Nov 03 '19 at 15:09
  • I found that try-catch does not always catch all errors! See https://stackoverflow.com/questions/58558585/powershell-try-catch-loses-repeated-access-errors . So I'm treading carefully and checking my results. But now I wonder if I can take what your answer said about passing multiple `process` blocks and what your comment said about `measure-object` working in streaming fashion, and combining that to do a count of the output of process A in parallel with but independent from the count taking place in process B. Should that work as an alternative to saving the array from A and measuring it? – NewSites Nov 03 '19 at 16:00
  • No, I don't think what I asked about in my previous comment is sufficient. To do the check I need, it needs to be done on the result of process A as a whole, outside the pipeline. – NewSites Nov 03 '19 at 16:17