1

When I execute the following simple script in PowerShell 7.1, I get the (correct) value of 3, regardless of whether the script's encoding is Latin1 or UTF8.

'Bär'.length

This surprises me because I was under the (apparently wrong) impression that the default encoding in PowerShell 5.1 is UTF16-LE and in PowerShell 7.1 UTF-8.

Because both scripts evaluate the expression to 3, I am forced to conclude that PowerShell 7.1 applies some heuristic method to infer a Script's encoding when executing it.

Is my conclusion correct and is this documented somewhere?

René Nyffenegger
  • 39,402
  • 33
  • 158
  • 293
  • Related: [What is the correct encoding for PS1 files](https://stackoverflow.com/questions/41939799/what-is-the-correct-encoding-for-ps1-files) – Mitch Dec 12 '20 at 17:42

3 Answers3

2

I was under the (apparently wrong) impression that the default encoding in PowerShell 5.1 is UTF16-LE and in PowerShell 7.1 UTF-8.

There are two distinct default character encodings to consider:

  • The default output encoding used by various cmdlets (Out-File, Set-Content) and the redirection operators (>, >>) when writing a file.

    • This encoding varies wildly across cmdlets in Windows PowerShell (PowerShell versions up to 5.1) but now - fortunately - consistently defaults to BOM-less UTF-8 in PowerShell [Core] v6+ - see this answer for more information.

    • Note: This encoding is always unrelated to the encoding of a file that data may have been read from originally, because PowerShell does not preserve this information and never passes text as raw bytes through - text is always converted to .NET ([string], System.String) instances by PowerShell before the data is processed further.

  • The default input encoding, when reading a file - both source code read by the engine and files read by Get-Content, for instance, which applies only to files without a BOM (because files with BOMs are always properly recognized).

    • In the absence of a BOM:

      • Windows PowerShell assumes the system's active ANSI code page, such as Windows-1252 on US-English systems. Note that this means that systems with different active system locales (settings for non-Unicode applications) can interpret a given file differently.

      • PowerShell [Core] v6+ more sensibly assumes UTF-8, which is capable of representing all Unicode characters and whose interpretation doesn't depend on system settings.

    • Note that these are fixed, deterministic assumptions - no heuristic is employed.

    • The upshot is that for cross-edition source code the best encoding to use is UTF-8 with BOM, which both editions recognize properly.


As for a source-code file containing 'Bär'.length:

If the source-code file's encoding is properly recognized, the result is always 3, given that a .NET string instance ([string], System.String) is constructed, which in memory is always composed of UTF-16 code units ([char], System.Char), and given that .Length counts the number of these code units.[1]

Leaving broken files out of the picture (such as a UTF-16 file without a BOM, or a file with a BOM that doesn't match the actual encoding):

The only scenario in which .Length does not return 3 is:

  • In Windows PowerShell, if the file was saved as a UTF-8 file without a BOM.

    • Since ANSI code pages use a fixed-width single-byte encoding, each byte that is part of a UTF-8 byte sequence is individually (mis-)interpreted as a character, and since ä (LATIN SMALL LETTER A WITH DIAERESIS, U+00E4) is encoded as 2 bytes in UTF-8, 0xc3 and 0xa4, the resulting string has 4 characters.
    • Thus, the string renders as Bär
  • By contrast, in PowerShell [Core] v6+, a BOM-less file that was saved based on the active ANSI (or OEM code) page (e.g., with Set-Content in Windows PowerShell) causes all non-ASCII characters (in the 8-bit range) to be considered invalid characters - because they cannot be interpreted as UTF-8.

    • All such invalid characters are simply replaced with (REPLACEMENT CHARACTER, U+FFFD) - in other words: information is lost.
    • Thus, the string renders as B�r - and its .Length is still 3.

[1] A single UTF-16 code unit is capable of directly encoding all 65K characters in the so-called BMP (Basic Multi-Lingual Plane) of Unicode, but for characters outside this plane pairs of code units encode a single Unicode character. The upshot: .Length doesn't always return the count of characters, notably not with emoji; e.g., ''.length is 2

mklement0
  • 382,024
  • 64
  • 607
  • 775
  • 1
    *All such invalid characters are simply replaced with � (REPLACEMENT CHARACTER, U+FFFD)* is what I didn't know. So, the length of 3 makes perfectly sense. Thank you very much for your answer and please excuse my late accepting it. – René Nyffenegger Dec 14 '20 at 19:59
  • My pleasure, @RenéNyffenegger; I'm glad it was helpful. – mklement0 Dec 14 '20 at 20:19
1

The encoding is unrelated to this case: you are calling string.Length which is documented to return the number of UTF-16 code units. This roughly correlates to letters (when you ignore combining characters and high codepoints like emoji)

Encoding only comes into play when converting implicitly or explicitly to/from a byte array, file, or p/invoke. It doesn’t affect how .Net stores the data backing a string.

Speaking to the encoding for PS1 files, that is dependent upon version. Older versions have a fallback encoding of Encoding.ASCII, but will respect a BOM for UTF-16 or UTF-8. Newer versions use UTF-8 as the fallback.

In at least 5.1.19041.1, loading the file 'Bär'.Length (27 42 C3 A4 72 27 2E 4C 65 6E 67 74 68) and running it with . .\Bar.ps1 will result in 4 printing.

If the same file is saved as Windows-1252 (27 42 E4 72 27 2E 4C 65 6E 67 74 68), then it will print 3.

tl;dr: string.Length always returns number of UTF-16 code units. PS1 files should be in UTF-8 with BOM for cross version compatibility.

Mitch
  • 21,223
  • 6
  • 63
  • 86
  • The string *is* from a file, and I nowhere do I indicate the file's encoding. So, somehow, PowerShell *must* infer the file's encoding. – René Nyffenegger Dec 12 '20 at 17:21
  • Right - which could cause the wrong character to be displayed if Powershell loads the script file in the wrong encoding (or if your console is set to the wrong encoding), but it would not cause the length to be different. There are still the same number of characters. – Mitch Dec 12 '20 at 17:27
0

I think without a BOM, PS 5 assumes ansi or windows-1252, while PS 7 assumes utf8 no bom. This file saved as ansi in notepad works in PS 5 but not perfectly in PS 7. Just like a utf8 no bom file with special characters wouldn't work perfectly in PS 5. A utf16 ps1 file would always have a BOM or encoding signature. A powershell string in memory would always be utf16, but a character is considered to have a length of 1 except for emoji's. If you have emacs, esc-x hexl-mode is a nice way to look at it.

'¿Cómo estás?'
 format-hex file.ps1

   Label: C:\Users\js\foo\file.ps1

          Offset Bytes                                           Ascii
                 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
          ------ ----------------------------------------------- -----
0000000000000000 27 BF 43 F3 6D 6F 20 65 73 74 E1 73 3F 27 0D 0A '¿Cómo estás?'��
js2010
  • 23,033
  • 6
  • 64
  • 66