[char[]] (32..255) | Set-Content outfile.txt
In Windows PowerShell this will create an "ANSI"-encoded file. The term "ANSI" encoding is an umbrella term for the set of fixed-width, single-byte, 8-bit encodings on Windows that are a superset of ASCII encoding. The specific "ANSI" encoding that is used is implied by the code page associated with the legacy system locale in effect on your system[1]; e.g., Windows-1252 on US-English systems.
See the bottom section for why "ANSI" encoding should be avoided.
If you were to do the same thing in PowerShell Core, you'd get a UTF-8-encoded file without a BOM, which is the best encoding to use for cross-platform and cross-locale compatibility.
In Windows PowerShell, adding -Encoding utf8
would give you an UTF-8 file too, but with BOM.
If you used -Encoding Unicode
or simply used redirection operator >
or Out-File
, you'd get a UTF-16LE-encoded file.
(In PowerShell Core, by contrast, >
produces BOM-less UTF-8 by default, because the latter is the consistently applied default encoding).
Note: With strings and numbers, Set-Content
and >
/ Out-File
can be used interchangeably (encoding differences in Windows PowerShell aside); for other types, only >
/ Out-File
produces meaningful representations, albeit suitable only for human eyeballs, not programmatic processing - see this answer for more.
ASCII code points are limited to 7-bit values, i.e., the range 0x0
- 0x7f
(127
).
Therefore, your input values 128
- 255
cannot be represented as ASCII characters, and using -Encoding ASCII
results in invalid input characters getting replaced with literal ?
characters (code point 0x3f
/ 63
), resulting in loss of information.
Important:
In memory, casting numbers such as 32
(0x20
) or 255
(0xFF
) to [char]
(System.Char
) instances causes the numbers to be interpreted as UTF-16 code units, representing Unicode characters[2] such as U+0020
and U+00FF
as 2-byte sequences using the native byte order, because that's what characters are in .NET.
Similarly, instances of the .NET [string]
type System.String
are sequences of one or more [char]
instances.
On output to a file or during serialization, re-encoding of these UTF-16 strings may occur, depending on the implied or specified output encoding.
If the output encoding is a fixed single-byte encoding, such as ASCII
, Default
("ANSI"), or OEM
, loss of information may occur, namely if the string to output contains characters that cannot be represented in the target encoding.
Choose one of the Unicode-based encoding formats to guarantee that:
- no information is lost,
- the resulting file is interpreted the same on all systems, irrespective of their system locale.
- UTF-8 is the most widely recognized encoding, but note that Windows PowerShell (unlike PowerShell Core) invariably prepends a BOM to such files, which can cause problems on Unix-like platforms and with utilities of Unix heritage; it is a format focused on and optimized for backward compatibility with ASCII encoding that uses between 1 - 4 bytes to encode a single character.
- UTF-16LE (which PowerShell calls
Unicode
) is a direct representation of the in-memory code units, but note that each characters is encoded with (at least) 2 bytes, which results in up to twice the size of UTF-8 files for strings that primarily contain characters in the ASCII range.
- UTF-16BE (which PowerShell calls
bigendianunicode
) reverses the byte order in each code unit.
- UTF-32LE (which PowerShell calls
UTF32
), represents each Unicode character as a fixed 4-byte sequence; even more so than with UTF-16, this typically results in unnecessarily large files.
- UTF-7 should be avoided altogether, as it is not part of the Unicode standard.
[1] Among the legacy code pages supported on Windows, there are also fixed double-byte as well as variable-width encodings, but only for East Asian locales; sometimes they're (incorrectly) collectively referred to as DBCS (Double-Byte Character Set), as opposed to SBCS (Single-Byte Character Set); see the list of all Windows code pages.
[2] Strictly speaking, a UTF-16 code unit identifies a Unicode code point, but not every code point by itself is a complete Unicode character, because some (rare) Unicode characters have a code point value that falls outside the range that can be represented with a 16-bit integer, and these code points can alternatively represented by a sequence of 2 other code points, known as surrogate pairs.