3

This is a very simple example

$Test = @('ae','æ')
$Test | Select-Object -Unique

The output

ae

What is going on here and how can I avoid it. Obviously I do not want "ae" to be equal to "æ"

Anders
  • 567
  • 1
  • 7
  • 23
  • 2
    Which culture / locale are you using? – vonPryz Jun 15 '22 at 05:55
  • 1
    I assume this concerns *Windows* PowerShell (you might explicitly **tag** this). This is indeed related to the (default) culture: `'ae' -eq 'æ'` --> `True`. See: [PowerShell: How to set culture?](https://stackoverflow.com/questions/60266401/powershell-how-to-set-culture) – iRon Jun 15 '22 at 08:14
  • It make sense with the Culture - thanks for your input – Anders Jun 15 '22 at 16:17

2 Answers2

3

As mentioned in the comments, your current culture settings identify ae and æ as equal, so it's only returning the first one in the input array.

If you reverse the order you'll get æ instead:

$Test = @('æ','ae')
$Test | Select-Object -Unique
# æ

You can check which culture PowerShell is using with this:

PS> Get-Culture

LCID             Name             DisplayName
----             ----             -----------
2057             en-GB            English (United Kingdom)

Although note that per @mklement0's comment, PowerShell doesn't use this culture consistently for everything...

Turns out that the current culture indeed applies to Select-Object -Unique (which is currently unexpectedly also (invariably) case-sensitive). It seems that PowerShell has a split personality with respect to culture invariance: [string] casts, string interpolation and string-relevant operators (except >) use the invariant culture, whereas cmdlets use the current one.

In any case, rather than a culture-aware comparison, it sounds like what you're after is an "ordinal" comparison - for more details see Ordinal String Operations:

Ordinal comparisons are string comparisons in which each byte of each string is compared without linguistic interpretation; for example, "windows" does not match "Windows".

(And by extension, ae, does not equal æ)

I can't find an idiomatic way to do that in PowerShell (you can change culture with Set-Culture, but all the ones I tried still treat ae equal to æ), but if you want more control over how values are compared, you could drop down into Linq like this:

PS> $data = @( "ae", "æ" )
PS> [System.Linq.Enumerable]::Distinct([string[]]$data, [System.StringComparer]::Ordinal )
ae
æ

You've then got a whole bunch of different way to compare strings:

https://learn.microsoft.com/en-us/dotnet/api/system.stringcomparer?view=net-6.0#properties

  • CurrentCulture - Gets a StringComparer object that performs a case-sensitive string comparison using the word comparison rules of the current culture.

  • CurrentCultureIgnoreCase - Gets a StringComparer object that performs case-insensitive string comparisons using the word comparison rules of the current culture.

  • InvariantCulture - Gets a StringComparer object that performs a case-sensitive string comparison using the word comparison rules of the invariant culture.

  • InvariantCultureIgnoreCase - Gets a StringComparer object that performs a case-insensitive string comparison using the word comparison rules of the invariant culture.

  • Ordinal - Gets a StringComparer object that performs a case-sensitive ordinal string comparison.

  • OrdinalIgnoreCase - Gets a StringComparer object that performs a case-insensitive ordinal string comparison.

and you can even implement your own:

class FirstLetterComparer : System.Collections.Generic.IEqualityComparer[string] {
  [bool]Equals([string]$x, [string]$y) { return $x[0] -eq $y[0]; }
  [int]GetHashCode([string] $x) { return $x[0].GetHashCode(); }
}

# returns the first item in the list that starts with each distinct character.
# note that "abb" is omitted because it starts with the same first letter as "aaa"
# so it's not "first letter distinct".
$data = @( "aaa", "abb", "bbb" )
[System.Linq.Enumerable]::Distinct([string[]]$data, [FirstLetterComparer]::new() )
# aaa
# bbb
mclayton
  • 8,025
  • 2
  • 21
  • 26
2

To add to mclayton's excellent answer, with background information:

  • While with cmdlets such as Select-Object PowerShell does indeed use the current culture, there are contexts in which it uses the invariant culture, notably the -eq / -ne operators - see this answer.

  • PowerShell has two distinct editions, and they differ with respect to the behavior at hand, due to what edition of .NET they're built on:

Read on for details.


æ is a ligature that is formed from the letters a and e.

  • Windows PowerShell / NLS:

    • The ligature æ is considered equivalent to the sequence of its constituent letters in most cultures, except in those:

      • where æ is in use as a character in its own right ...
      • and is not considered equivalent to the sequence of its constituent letters.
    • These exceptions are (only the so-called neutral (non-nation-specific) cultures are listed, not also their national varieties):

      • da (Danish)
      • is (Icelandic)
      • kl (Kalaallisut)
      • nb (Norwegian Bokmål)
      • nn (Norwegian Nynorsk)
      • no (Norwegian)
      • se (Northern Sami)
      • sma (Sami (Southern))
      • smj (Sami (Lule))
      • smn (Sami (Inari))
      • sms (Sami (Skolt))
    • Other ligatures have multi-letter equivalents in all cultures, such as œ vs. oe; there are also ligatures whose multi-letter equivalent is not the sequence of its constituent letters, but a modern equivalent, e.g., German ß (which originated from sz) is considered equivalent to ss.

  • PowerShell (Core) 7+ / ICU:

    • At least as of the ICU version that underlies PowerShell 7.2.4, ligatures in general are seemingly never considered equivalent to their constituent letters in string comparisons.
mklement0
  • 382,024
  • 64
  • 607
  • 775