4

I'm just curious if I'm missing any documentation, or if there is a different/better way to do this that negates the need for documentation. Maybe I'm the only one trying to use Select-Object to select the -First X unique instances from a set of data.

Based on the testing below, it looks like using Select-Object with the -Unique switch and some type of limiter (First, Last, Skip, Index, etc.) inherently causes the limiter to be applied BEFORE removing duplicates. This doesn't make sense to me conceptually, but also doesn't appear to be documented.

I apologize for the poor example, but consider an array of 20 items with each item appearing twice:

PS > $array = @() ; 1..10 | % { $array += $_ ; $array += $_ }
PS > $array -Join ','
1,1,2,2,3,3,4,4,5,5,6,6,7,7,8,8,9,9,10,10  ##Displaying the array on a single comma separated line

Let's say that someone gives you $array, but you can only handle a maximum input of 5 objects. Filtering down what you're given, you might be tempted to use Select-Object. At first you end up with 5 objects, but there are duplicates, so quick thinking you simply add the -Unique switch and then you realize that the output still isn't quite right.

PS > ($array | Select-Object -First 5) -Join ','
1,1,2,2,3  ##5 objects as expected, but with duplicates
PS > ($array | Select-Object -Unique -First 5) -Join ','
1,2,3  ##No duplicates, but less than the expected 5 objects...

To get the outcome I was expecting, I'd need Select-Object to remove the duplicates prior to returning the final set of objects. While there is nothing wrong in knowing this, it seems strange to me that the Select-Object uses the order of operations that it does and also that there isn't any documentation around the fact that the -Unique switch is applied at the end of the cmdlet.

PS > ($array | Select-Object -Unique | Select-Object -First 5) -Join ','
1,2,3,4,5  ##This is my expected outcome, 5 objects returned without any duplicates
mklement0
  • 382,024
  • 64
  • 607
  • 775
immobile2
  • 489
  • 2
  • 15
  • 1
    I think this has to do with _"When you include a Select-Object command with the First or Index parameters in a command pipeline, PowerShell stops the command that generates the objects as soon as the selected number of objects is generated, even when the command that generates the objects appears before the Select-Object command in the pipeline. To turn off this optimizing behavior, use the Wait parameter."_. I cannot test this now (on mobile), but you could try with the `-Wait` switch if that indeed changes the behavior. – Theo Oct 14 '21 at 21:11
  • @Theo, the behavior isn't related to `-Wait`, Instead, `-Unique` modifies the _output_ of `Select-Object`: It is applied to whatever output results from applying the _other_ parameters, such as `-First`. – mklement0 Oct 14 '21 at 22:06
  • Good thought @Theo on `-Wait`, but I actually gave that a shot as well before posting and as @mklement0 says, it isn't related. I just found this to be strange behavior and wanted to make sure it was working the way it seemed, as well as try to document it for others if they run into it. – immobile2 Oct 15 '21 at 01:08
  • The [`Select-Object help topic`](https://learn.microsoft.com/powershell/module/microsoft.powershell.utility/select-object) has now been updated with the statement "`Unique` selects values _after_ other filtering parameters are applied." and now also provides a relevant example. – mklement0 Oct 16 '21 at 14:54

1 Answers1

5

Indeed, the -First / -Last / -Skip / -Index / -SkipIndex / -SkipLast parameters apply to the original input first, and -Unique is applied to the resulting output.

The simple workaround is to use two Select-Object calls: one that finds the unique objects, and another that selects the desired number from among the unique ones:

PS> 1, 1, 2, 3 | Select-Object -Unique | Select-Object -First 2
1
2

Given that Select-Object -Unique is excessively slow as of PowerShell 7.2 (see bottom section), here is a faster workaround, as you've discovered yourself: Use an aux. System.Collections.Generic.HashSet`1 instance combined with ForEach-Object; the example also shows support for case-insensitivity, which Select-Object -Unique currently lacks (see bottom section):

# Create an aux. hash set that keeps tracks of what objects have
# already been seen, using case-*insensitive* comparisons.
$auxHashSet = [Collections.Generic.HashSet[string]]::new(
                [StringComparer]::InvariantCultureIgnoreCase
              )

# Stream to ForEach-Object, where the aux. hash set is used
# to only pass out objects that haven't previously been seen.
'a', 'A', 'B', 'c' |
  ForEach-Object { if ($auxHashSet.Add($_)) { $_ } } |
    Select-Object -First 2

This outputs 'a', 'B', as desired. Note that you may want to remove $auxHashSet variable so as to (eventually) free its memory - see next.

Using a -Begin block with ForEach-Object, you can make the pipeline more self-contained, but note that all script blocks run directly in the caller's scope, so that $auxHashSet is still created there and would live on after the command, so you'll still have to manually remove it and thereby (eventually) release its memory.

  • Note: While in principle you could do that in an -End block, this does not work with Select-Object -First, because the premature stopping of the pipeline does not give upstream cmdlets a chance to run their end blocks - see GitHub issue #7930 for a discussion of this surprising behavior.
'a', 'A', 'B', 'c' |
  ForEach-Object -Begin { 
    $auxHashSet = [Collections.Generic.HashSet[string]]::new([StringComparer]::InvariantCultureIgnoreCase) 
  } -Process {
    if ($auxHashSet.Add($_)) { $_ } 
  } |
    Select-Object -First 2
# Remove the aux. variable and (eventually) free its memory.
Remove-Variable auxHashSet 

Note that there's also a LINQ-based alternative, via [System.Linq.Enumerable]::Distinct(), but it has important constraints:

  • The output is unordered i.e. the input order is not guaranteed to be preserved.

  • You cannot stream the method's input collection from a PowerShell command (to pass a PowerShell command's output to a method, it must be collected in full in an array, up front) - however, the output from LINQ methods such as Distinct() is effectively streaming, due to returning a lazy enumerable.[1]

  • Additionally, the input array must be strongly typed, if it isn't already. PowerShell makes this easy with a cast such as [int[]], but note that with an [object[]]-based array as input (which is what regular PowerShell arrays are, such as used for collection command output), but do note that this involves creating a copy of the array, which with large input collections can by itself take a while.

[Linq.Enumerable]::Distinct(
  [string[]] ('a', 'A', 'B', 'c'), 
  [StringComparer]::InvariantCultureIgnoreCase
) | Select-Object -First 2

This too outputs 'a', 'B' (though the order of the output elements isn't guaranteed).

If the constraints aren't a concern and you need to find the unique elements in the whole input collection (or a large part of it), this solution is considerably faster than the hash-set-assisted ForEach-Object solution, especially if your input collection is already strongly typed.

If, within the same constraints, you don't care about the lazy output behavior and just want to get an in-memory collection of all distinct objects - again, unordered - you can use a System.Collections.Generic.HashSet`1 instance directly:

[Collections.Generic.HashSet[string]]::new(
  [string[]] ('a', 'A', 'B', 'c'), 
  [System.StringComparer]::InvariantCultureIgnoreCase
)

This outputs 'a', 'B', 'c', but notably as a hash-set object, not an array, but, due to being enumerable, it'll behave like an array in PowerShell's enumeration contexts, notably in the pipeline.


Select-Object -Unique pitfalls, contrast with Sort-Object:

  • While the extra Select-Object call does add processing overhead, the command overall has the potential to only processes only as many input objects as needed, i.e. to stop processing once the desired number of unique objects have been found.

  • However, as of PowerShell 7.2, it seems that Select-Object -Unique is implemented inefficiently and unexpectedly collects all input first before producing output, even though there's no conceptual reason to do so: it should be able to produce streaming output, i.e. to - conditionally - output input objects as they're being received, because it only needs to consider what input objects have been received so far.

    • In practice, as of as of PowerShell 7.2, Select-Object -Unique is excessively slow with larger input collections; the current, problematic implementation is discussed in GitHub issues #11221 and #7707.

    • This conceptual ability to only consider input received so far contrasts with Sort-Object, which also offers a -Unique switch, but of necessity must collect all input first before producing output, because all input objects must be considered for proper sorting.

      • As of PowerShell 7.2, Sort-Object -Unique is much faster in practice than Select-Object -Unique.
    • As for how Select-Object -Unique could be implemented in a more efficient, streaming manner: The objects seen so far could be stored in a System.Collections.Generic.HashSet`1 instance to facilitate an efficient test for whether an input object is considered equal to one that has already been output; see this answer for a PowerShell example.

  • If and when Select-Object -Unique is fixed, the tradeoff is as follows:

    • The smaller the proportion of the output objects of interest is to in relation to all input objects, the better off you are using Select-Object -Unique (even if you have to sort the resulting objects afterwards).

    • If you need to output / consider all input objects anyway, and assuming that outputting the objects of interest in sort order is desired / acceptable, Sort-Object is the better choice.

  • As of PowerShell 7.2, Select-Object -Unique is unexpectedly case-sensitive for string input, even though PowerShell is normally case-insensitive by default - see GitHub issue #12059.


Testing whether a cmdlet produces streaming output or collects all input first:

Short of examining a cmdlet's source code, here's a way to test - the middle pipeline segment is the command to test:

# Test Sort-Object -Unique
# Because the command cannot stream, for conceptual reasons, 
# it takes a while for the one and only output object to appear.
1..1e5 | Sort-Object -Unique | Select-Object -First 1
# Test Select-Object -Unique
# The command *could* stream, conceptually speaking, in which case
# the output object would appear right away.
# However, as of PowerShell 7.2, the command isn't implemented
# in a streaming fashion, so it takes a - surprisingly long - while
# for the output object to appear.
# it takes a while for the one and only output object to appear.
1..1e5 | Select-Object -Unique | Select-Object -First 1

If the given pipeline above produces its one and only output object near instantly, the command of interest is streaming; if it takes a while before the output object appears, it collects all input first.

mklement0
  • 382,024
  • 64
  • 607
  • 775
  • Glad to hear it, @immobile2. If you pass a _.NET_ lazy enumerable - notably as returned by LINQ methods - _and_ the target method is typed as an enumerable, then method calls from PowerShell can perform on-demand enumeration too. However, while the PowerShell pipeline embodies the same concept, it cannot be integrated with such .NET method calls: PowerShell invariably collects a PowerShell command's output in a static array, in full, before passing that to the method. – mklement0 Oct 22 '21 at 15:34
  • The above applies as of PowerShell 7.2. While obtaining a lazy enumerable for a PowerShell command call would be a nice feature, I'm not sure if it would be considered an important enough feature to warrant implementation and I suspect that implementing it would present nontrivial technical challenges. – mklement0 Oct 22 '21 at 15:37
  • @immobile2, indeed, `String.Split()` only accepts static string _arrays_ as input (just execute `'foo'.Split` to see the overloads). – mklement0 Oct 22 '21 at 17:15
  • Yes `.ForEach()` and `.Where()` are provided by PowerShell, as so-called [intrinsic members](https://docs.microsoft.com/en-us/powershell/module/microsoft.powershell.core/about/about_Intrinsic_Members), but given that they're _methods_, the usual .NET-method rules apply. Fundamentally, they do enumerate a lazy enumerable as such (though themselves do not produce lazy enumerables). With lazy LINQ enumerables, however, `.Where()` is preempted by a type-native method of the same name. You can call `.ForEach()` on them, however. Use a `foreach` _loop_ if you want to stop enumeration on demand. – mklement0 Oct 22 '21 at 17:16