1

I'm trying to create a PowerShell hash table to convert non-ASCII (UTF8) characters to their ASCII look-a-likes.

Here are two hash table entries as examples: 'ñ'='n' and 'Ñ'='N'.

Editor's note: Using both theses entries in the same hash table literal (@{ 'ñ'='n'; 'Ñ'='N' }) wouldn't work, because PowerShell uses hash tables with case-insensitive key lookups and therefore considers 'ñ' and 'Ñ'duplicate keys and complains. However, this is incidental to the problem at hand.

The first one works: 'ñ' is 0xc3b1. The second one does not work: 'Ñ' is 0xc391 which PowerShell won't accept. (The problem seems to be that 0x91 is outside the range of an acceptable powershell char.)

A simpler example of the problem is:

$c = [convert]::toChar(0x91)

which results in $c getting a value of 0x3f instead of 0x91. So what can I do to get 'Ñ'='N' into the hash table, or a char with a value of 0x91? I've already spent hours reading web pages and experimenting.

js2010
  • 23,033
  • 6
  • 64
  • 66
TomEggers
  • 13
  • 2
  • According to PowerShell, `[char]0xc391` is Korean character sselt `쎑`, try `[char]0x00D1` which is the utf-16 character. https://www.compart.com/en/unicode/U+00F1 – Nico Nekoru Jun 27 '20 at 23:56
  • Does this answer your question? [PowerShell Hash Tables Double Key Error: "a" and "A"](https://stackoverflow.com/questions/24054147/powershell-hash-tables-double-key-error-a-and-a) – 7cc Jun 28 '20 at 01:21
  • @7cc isn't this about Unicode not duplicate elements in a hashtable? – Nico Nekoru Jun 28 '20 at 01:24
  • Powershell 7 gives a better error message: `Duplicate keys 'Ñ' are not allowed in hash literals.` – js2010 Jun 28 '20 at 13:22
  • @NekoMusume utf16 and unicode are 2 different things, but microsoft makes it confusing – js2010 Jun 28 '20 at 13:36
  • @js2010 Windows PowerShell emits the same error message. The primary problem here is unrelated to (case-insensitively) duplicate hash-table keys; it is a _character encoding_ problem, stemming from the fact that Windows PowerShell (mis)reads BOM-less UTF-8 files as ANSI-encoded. – mklement0 Jun 29 '20 at 02:51
  • @js2010 I've edited the question to make it clearer, but my previous comment still applies: the question isn't about duplicate keys, and both PS editions use the same error message. – mklement0 Jul 01 '20 at 15:23
  • You're right about powershell versions. I guess my colors in one were harder to read. I would guess that he's reading the hashtables from a file? But there's no info about that in the question. – js2010 Jul 01 '20 at 15:29
  • Tom, I've added an explanation for the `[convert]::toChar(0x91)` behavior to the answer. – mklement0 Jul 01 '20 at 15:31
  • @js2010 There's no information about reading from a file (meaning: PowerShell reading a `*.ps1` file containing source code with a hash-table literal for execution) in the _question_, because Tom didn't realize that's where the problem was. My _answer_ clarifies that. – mklement0 Jul 01 '20 at 15:43

1 Answers1

4

Note: By default, PowerShell hashtables, due to using case-insensitive lookups, do not support keys that are mere case variations of another; therefore, ñ and Ñ - the former being the lowercase version of the latter - cannot both be used as keys - see bottom section.


In memory, all PowerShell strings are UTF-16 .NET strings, which are capable of representing all Unicode characters, so using character such as Ñ as keys in hash tables is not a problem.

The problem you describe only arises when PowerShell misinterprets source code read from a file, due to assuming the wrong character encoding.

Your symptom suggests that your source code is UTF-8-encoded, but the file doesn't have a BOM, which causes Windows PowerShell (but, fortunately, no longer PowerShell [Core] v6+) to misinterpret the file as encoded based on the system's active legacy ANSI code page (e.g., Windows-1252 on US-English systems), a single-byte encoding.

Make sure that your source-code file is saved as UTF-8 with a BOM[1], and your problem will go away.

What you think are Unicode code points, 0xc3b1 and 0xc391, are in reality the 2-byte UTF-8 encodings (0xc3 0xb1 and 0xc3 91) of the true code points corresponding to ñ and Ñ: 0xf1 and 0xd1


As for:

[convert]::toChar(0x91)

seemingly not producing a [char] instance with the given code point, 0x91 (decimal 145):

  • It does, namely in memory, which you can easily verify:

      [int] [convert]::toChar(0x91) # -> 145 (0x91)
    
  • You'll only get 0x3f - which is a literal ? character (try [char] 0x3f) - if you mistakenly save the in-memory representation with ASCII encoding: since 0x91 is outside the ASCII sub-range of Unicode (which goes from 0x00 to 0x7f), it cannot be represented in the output file, and the substitute character ? is used.


Note that PowerShell's hash tables are case-insensitive, so you cannot have keys that are merely case variations of one another:

# !! FAILS
PS> @{ Ñ = 'LATIN CAPITAL LETTER N WITH TILDE'; ñ = 'LATIN SMALL LETTER N WITH TILDE' }
...  Duplicate keys 'ñ' are not allowed in hash literals.

You must use the .NET [hashtable] type (System.Collections.Hashtable) directly to create case-sensitive hash tables:

# Create case-SENSITIVE hash table:
$ht = [hashtable]::new()
$ht['ñ'] = 'LATIN SMALL LETTER N WITH TILDE' 
$ht['Ñ'] = 'LATIN CAPITAL LETTER N WITH TILDE'
  • $ht now has 2 entries and $ht['ñ'] and $ht['Ñ'] retrieve the values case-sensitively.

  • By contrast, if you had used $ht = @{}, i.e. initialized the hash table as a regular, case-insensitive hash table, you'd only get 1 entry with value 'LATIN CAPITAL LETTER N WITH TILDE', because the 2nd assignment, $ht['Ñ'] =, simply updated the case-insensitively looked-up key created by the 1st statement.


[1] Alternatively, use a UTF-16 encoding, which invariably uses a BOM; the UTF-16LE form is (erroneously) referred to as Unicode in PowerShell.

mklement0
  • 382,024
  • 64
  • 607
  • 775