2

I am trying to generate strings with 1 of every ASCII character. I started with

32..255| %{[char]$_ | Out-File -filepath .\outfile.txt -Encoding ASCII -Append}

I expected the list of printable characters, but I got different characters.

Can anyone point me to either a better way to get my expected result or an explanation as to why I'm getting these results?

Maximilian Burszley
  • 18,243
  • 4
  • 34
  • 63
mberger
  • 65
  • 1
  • 8
  • Take your `Out-File` and put it outside of the `ForEach-Object` call. `I got different characters.` *What* did you get? – Maximilian Burszley Oct 19 '18 at 13:40
  • 3
    ASCII is not ANSI. ANSI is a weird, confusing name for whatever character set is the default. There are no ASCII code points above 127, and 127 (DEL) is not printable. If you really want "ANSI" (for some values of "ANSI") try `-Encoding Default` or `-Encoding OEM`. Figuring out the results and how to re-interpret them correctly is left as an exercise to the reader, though. Best advice: don't use any of these and stick to UTF-8. – Jeroen Mostert Oct 19 '18 at 13:44
  • Oh. Thanks! They're not exactly the same but it will do for my purposes. I'm fuzzy as to why that worked though. – mberger Oct 19 '18 at 13:46
  • 1
    The table you link to is some interesting encoding possibly used by Juniper products, but it's not ASCII (and not Windows-1252 either, the most popular alternative). There is no built-in encoding class in .NET (and hence not in PowerShell either) that will map code points as specified in that table. – Jeroen Mostert Oct 19 '18 at 13:47
  • @TheIncorrigible1 I was doing this `PS C:\Users\mberger> [char]1 ☺` and confused why if unicode and ASCII were supposed to overlap – mberger Oct 19 '18 at 14:00
  • 1
    Mapping the code point `1` to a smiley face is neither ASCII nor Unicode, that's a curiosity of the IBM code pages used originally by DOS and sometimes still elsewhere. .NET (and by extension PowerShell) have no encodings that will make `1` print as a smiley face (I just get a fallback character when I try to display the byte `1` as a character; ditto for all other characters that are control codes in ASCII). – Jeroen Mostert Oct 19 '18 at 14:16
  • 1
    @JeroenMostert: You're right, the linked table cannot be ASCII - simply due to including code points above `0x7f` (`127`); while the characters there that are in the 8-bit range don't render exactly as they would in [OEM code page 437](https://en.wikipedia.org/wiki/Code_page_437), they look similar enough to suggest that the differences are incidental rendering artifacts. Either way, the table should not be labeled "ASCII". – mklement0 Oct 19 '18 at 23:42
  • Although printing all non-control characters is a fine exercise, you have two very strange concepts in your question: ASCII and 255. In the PC-world, ASCII itself was not supported until well into to the Windows era, and then only for completeness. 255 went out with the introduction of [Unicode](http://www.unicode.org/charts/nameslist/index.html) into Windows NT 4, NTFS, Visual Basic 4 and Java in the early 1990s. – Tom Blodget Oct 20 '18 at 14:08

1 Answers1

4
[char[]] (32..255) | Set-Content outfile.txt

In Windows PowerShell this will create an "ANSI"-encoded file. The term "ANSI" encoding is an umbrella term for the set of fixed-width, single-byte, 8-bit encodings on Windows that are a superset of ASCII encoding. The specific "ANSI" encoding that is used is implied by the code page associated with the legacy system locale in effect on your system[1]; e.g., Windows-1252 on US-English systems.

See the bottom section for why "ANSI" encoding should be avoided.

If you were to do the same thing in PowerShell Core, you'd get a UTF-8-encoded file without a BOM, which is the best encoding to use for cross-platform and cross-locale compatibility.

In Windows PowerShell, adding -Encoding utf8 would give you an UTF-8 file too, but with BOM.
If you used -Encoding Unicode or simply used redirection operator > or Out-File, you'd get a UTF-16LE-encoded file.
(In PowerShell Core, by contrast, > produces BOM-less UTF-8 by default, because the latter is the consistently applied default encoding).

Note: With strings and numbers, Set-Content and > / Out-File can be used interchangeably (encoding differences in Windows PowerShell aside); for other types, only > / Out-File produces meaningful representations, albeit suitable only for human eyeballs, not programmatic processing - see this answer for more.

ASCII code points are limited to 7-bit values, i.e., the range 0x0 - 0x7f (127).

Therefore, your input values 128 - 255 cannot be represented as ASCII characters, and using -Encoding ASCII results in invalid input characters getting replaced with literal ? characters (code point 0x3f / 63), resulting in loss of information.


Important:

In memory, casting numbers such as 32 (0x20) or 255 (0xFF) to [char] (System.Char) instances causes the numbers to be interpreted as UTF-16 code units, representing Unicode characters[2] such as U+0020 and U+00FF as 2-byte sequences using the native byte order, because that's what characters are in .NET.
Similarly, instances of the .NET [string] type System.String are sequences of one or more [char] instances.

On output to a file or during serialization, re-encoding of these UTF-16 strings may occur, depending on the implied or specified output encoding.

  • If the output encoding is a fixed single-byte encoding, such as ASCII, Default ("ANSI"), or OEM, loss of information may occur, namely if the string to output contains characters that cannot be represented in the target encoding.

  • Choose one of the Unicode-based encoding formats to guarantee that:

    • no information is lost,
    • the resulting file is interpreted the same on all systems, irrespective of their system locale.
    • UTF-8 is the most widely recognized encoding, but note that Windows PowerShell (unlike PowerShell Core) invariably prepends a BOM to such files, which can cause problems on Unix-like platforms and with utilities of Unix heritage; it is a format focused on and optimized for backward compatibility with ASCII encoding that uses between 1 - 4 bytes to encode a single character.
    • UTF-16LE (which PowerShell calls Unicode) is a direct representation of the in-memory code units, but note that each characters is encoded with (at least) 2 bytes, which results in up to twice the size of UTF-8 files for strings that primarily contain characters in the ASCII range.
    • UTF-16BE (which PowerShell calls bigendianunicode) reverses the byte order in each code unit.
    • UTF-32LE (which PowerShell calls UTF32), represents each Unicode character as a fixed 4-byte sequence; even more so than with UTF-16, this typically results in unnecessarily large files.
    • UTF-7 should be avoided altogether, as it is not part of the Unicode standard.

[1] Among the legacy code pages supported on Windows, there are also fixed double-byte as well as variable-width encodings, but only for East Asian locales; sometimes they're (incorrectly) collectively referred to as DBCS (Double-Byte Character Set), as opposed to SBCS (Single-Byte Character Set); see the list of all Windows code pages.

[2] Strictly speaking, a UTF-16 code unit identifies a Unicode code point, but not every code point by itself is a complete Unicode character, because some (rare) Unicode characters have a code point value that falls outside the range that can be represented with a 16-bit integer, and these code points can alternatively represented by a sequence of 2 other code points, known as surrogate pairs.

mklement0
  • 382,024
  • 64
  • 607
  • 775