5

This Stack Overflow question deals with 16-bit Unicode characters. I would like a similar solution that supports 32-bit characters. See this link for a listing of the various Unicode charts. For example, a range of characters that are 32-bit are the Musical Symbols.

The answer in the question linked above doesn't work because it casts the System.Int32 value as a System.Char, which is a 16-bit type.

Edit: Let me clarify that I don't particularly care about displaying the 32-bit Unicode character, I just want to store the character in a string variable.

Edit #2: I wrote a PowerShell snippet that uses the info in the marked answer and its comments. I would have wanted to put this in another comment, but comments can't be multi-line.

$inputValue = '1D11E'
$hexValue = [int]"0x$inputValue" - 0x10000
$highSurrogate = [int]($hexValue / 0x400) + 0xD800
$lowSurrogate = $hexValue % 0x400 + 0xDC00
$stringValue = [char]$highSurrogate + [char]$lowSurrogate

Dour High Arch still deserves credit for the answer for helping me finally understand surrogate pairs.

Community
  • 1
  • 1
Chuck Heatherly
  • 1,086
  • 10
  • 9
  • 1
    Technically, there are no 32-bit Unicode code points, as Unicode is only a 21-bit code. – Joey Jan 29 '11 at 20:10
  • Seems nitpicky. Obviously U+1D11E doesn't use ALL 32 bits, but it is greater than 16 bits, thus why the question needed to be asked (since the linked question's answer only works for 16 bits). PowerShell and .NET have Int16 and Int32 types, is there one named Int21? Thus 32 is the next logical increment. – Chuck Heatherly Jan 29 '11 at 20:52
  • That ain't nitpicking. You're using Unicode terminology incorrectly, and being corrected. Unicode doesn't define **characters** as having the **bitness** property. It's incorrect to talk about "32-bit characters" or "16-bit characters", since Unicode defines neither concept. *Character* is an abstract writing symbol with various properties (like is it upper or lower case, is it RTL or LTR, &c). With how many bits it is **encoded** depends on the particular **encoding** used to **encode** the character into bytes. E.g. `â` is encoded to `C3 A2` in UTF-8, and to `E2` in ISO-8859-1 (aka Latin-1). – ulidtko Jan 05 '15 at 16:47
  • the [other question](https://stackoverflow.com/q/1056692/995714) already has updated answers for the full Unicode range. Try `echo "\`u{1F44D}"` or `echo [char]::ConvertFromUtf32(0x1F44D)` – phuclv Mar 25 '21 at 07:31
  • Does this answer your question? [How do I encode Unicode character codes in a PowerShell string literal?](https://stackoverflow.com/questions/1056692/how-do-i-encode-unicode-character-codes-in-a-powershell-string-literal) – phuclv Mar 25 '21 at 07:36
  • @phuclv Are you kidding me? I quoted that link in the first four words of my question. Ten years ago. – Chuck Heatherly Jul 30 '21 at 17:40
  • @ChuckHeatherly did you read that question again? It has answers for UTF-32 – phuclv Jul 30 '21 at 17:41
  • Yeah it didn't have those comments 10 years ago. – Chuck Heatherly Jul 30 '21 at 18:36

3 Answers3

5

IMHO, the most elegant way to use Unicode literals in PowerShell is

[char]::ConvertFromUtf32(0x1D11E)

See my blogpost for more details

mnaoumov
  • 2,146
  • 2
  • 22
  • 31
  • 2
    `U+1F4A9` is much more relevant, satisfying and fitting example. Especially for testing PowerShell. – ulidtko Jan 05 '15 at 16:52
  • So there's no way to turn a string of emojis into a character array? `''` 1f600-1f606 – js2010 Jun 15 '20 at 23:24
  • @js2010, I'm not strong in PowerShell frankly... [There is literally `.ToCharArray()`](https://tio.run/##Vc6/TgJBEMfxnqf4iRR3iTGKNl5iDCGhs5PKWOytC7sIN5uZWZDOP@gz2PmKvsE5jYXFp5ni@5tMu8ASw3rd96OwoVUSXGM4qu59dPzQNFPqtoF1xrSZ6@JiXJ09n88uJ1d1fYOf768X82rezLs5mA/zORwMgo@Ev@gxdsRPcoK2KNT2BEXQkcbULRtoDGj3GgSrIorsROzGVJbxf@j0jqb22YTZ7avasp5dVieaPOZd8vQYICVnYj1Cy25LuE2eSWihff8L) method on strings — but it outputs garbage. You'd have to somehow work around those *surrogate pairs* shenanigans which Microsoft loves so much. Maybe try UTF-8 and a less broken Unicode library?.. – ulidtko Jun 16 '20 at 13:14
  • @ulidtko I worked out a way with utf32 but it seemed like a lot of work https://stackoverflow.com/questions/62391665/spliting-an-emoji-sequence-in-powershell/62391840#62391840 – js2010 Jun 16 '20 at 13:29
2

Assuming PowerShell uses UTF-16, 32-bit code points are represented as surrogates. For example, U+10000 is represented as:

0xD100 0xDC00

That is, two 16-bit chars; hex D100 and DC00.

Good luck finding a font with surrogate chars.

Dour High Arch
  • 21,513
  • 29
  • 75
  • 90
  • How did you translate U+1D100 to those two surrogates? I see where the first one (D100) came from, but how did you come up with DC00 for the second? – Chuck Heatherly Jan 29 '11 at 01:57
  • Apologies, it was a typo. Fixed. The formula is given in the Wikipedia link – Dour High Arch Jan 29 '11 at 01:58
  • 1
    OK, I found a link to http://www.i18nguy.com/unicode/surrogatetable.html, which shows how to look up the high and low surrogates, given the 32-bit value you want to encode. So the value 1D11E is given by the surrogate pair D834 DD1E. And so I would cast each of those 16-bit values to a System.Char and then put both into a string variable. Thanks for helping me understand surrogate pairs finally! – Chuck Heatherly Jan 29 '11 at 02:31
0

FYI: If anyone wants to store surrogate pairs in a Case Sensitive HashTable, this seems to work:

$NCRs = new-object System.Collections.Hashtable
$NCRs['Yopf'] = [string]::new(([char]0xD835, [char]0xDD50))
$NCRs['yopf'] = [string]::new(([char]0xD835, [char]0xDD6A))
$NCRs['Yopf']
$NCRs['yopf']

Outputs:



Darin
  • 1,423
  • 1
  • 10
  • 12