I was under the (apparently wrong) impression that the default encoding in PowerShell 5.1 is UTF16-LE and in PowerShell 7.1 UTF-8.
There are two distinct default character encodings to consider:
The default output encoding used by various cmdlets (Out-File
, Set-Content
) and the redirection operators (>
, >>
) when writing a file.
This encoding varies wildly across cmdlets in Windows PowerShell (PowerShell versions up to 5.1) but now - fortunately - consistently defaults to BOM-less UTF-8 in PowerShell [Core] v6+ - see this answer for more information.
Note: This encoding is always unrelated to the encoding of a file that data may have been read from originally, because PowerShell does not preserve this information and never passes text as raw bytes through - text is always converted to .NET ([string]
, System.String
) instances by PowerShell before the data is processed further.
The default input encoding, when reading a file - both source code read by the engine and files read by Get-Content
, for instance, which applies only to files without a BOM (because files with BOMs are always properly recognized).
In the absence of a BOM:
Windows PowerShell assumes the system's active ANSI code page, such as Windows-1252 on US-English systems. Note that this means that systems with different active system locales (settings for non-Unicode applications) can interpret a given file differently.
PowerShell [Core] v6+ more sensibly assumes UTF-8, which is capable of representing all Unicode characters and whose interpretation doesn't depend on system settings.
Note that these are fixed, deterministic assumptions - no heuristic is employed.
The upshot is that for cross-edition source code the best encoding to use is UTF-8 with BOM, which both editions recognize properly.
As for a source-code file containing 'Bär'.length
:
If the source-code file's encoding is properly recognized, the result is always 3
, given that a .NET string instance ([string]
, System.String
) is constructed, which in memory is always composed of UTF-16 code units ([char]
, System.Char
), and given that .Length
counts the number of these code units.[1]
Leaving broken files out of the picture (such as a UTF-16 file without a BOM, or a file with a BOM that doesn't match the actual encoding):
The only scenario in which .Length
does not return 3
is:
In Windows PowerShell, if the file was saved as a UTF-8 file without a BOM.
- Since ANSI code pages use a fixed-width single-byte encoding, each byte that is part of a UTF-8 byte sequence is individually (mis-)interpreted as a character, and since
ä
(LATIN SMALL LETTER A WITH DIAERESIS, U+00E4
) is encoded as 2 bytes in UTF-8, 0xc3
and 0xa4
, the resulting string has 4 characters.
- Thus, the string renders as
Bär
By contrast, in PowerShell [Core] v6+, a BOM-less file that was saved based on the active ANSI (or OEM code) page (e.g., with Set-Content
in Windows PowerShell) causes all non-ASCII characters (in the 8-bit range) to be considered invalid characters - because they cannot be interpreted as UTF-8.
- All such invalid characters are simply replaced with
�
(REPLACEMENT CHARACTER, U+FFFD
) - in other words: information is lost.
- Thus, the string renders as
B�r
- and its .Length
is still 3
.
[1] A single UTF-16 code unit is capable of directly encoding all 65K characters in the so-called BMP (Basic Multi-Lingual Plane) of Unicode, but for characters outside this plane pairs of code units encode a single Unicode character. The upshot: .Length
doesn't always return the count of characters, notably not with emoji; e.g., ''.length
is 2