2

In Windows PowerShell I have used chcp 65001 and chosen a font that includes all of the characters I want.

If I display a UTF-8 file with type file.u8 it works fine and I get the desired characters.

If I run myprogram.exe then I get no output after the first non-ASCII character (if run prior to chcp 65001 this produces mojibake).

If I run myprogram.exe > test.u8 followed by type test.u8 that works, and I get the desired output.

So I reasoned I could bypass the file (using my limited PowerShell knowledge!) with myprogram.exe | % {echo "$_"} and that works. So it seems like the C++ runtime is doing something special when it is talking directly to a console which is breaking UTF-8 output.

(And I can get the desired output if I use wide characters, but I don't actually want UTF-16 output in the end, I want UTF-8. I just want the convenience of printing debug information without extra character transformations)

Ben Jackson
  • 90,079
  • 9
  • 98
  • 150
  • Setting an output codepage of 65001 via `SetConsoleOutputCP` (or chcp.com at the command line) works in Windows 8 and 10. It's buggy in older versions. Even in Windows 10, setting the input codepage to 65001 via `SetConsoleCP` (or via chcp.com) is limited to 7-bit ASCII text due to a bad design in conhost.exe, which can't handle variable-sized encodings. That doesn't explain getting no output at all, however. Without redirection, this should have nothing to do with PowerShell; myprogram.exe writes directly to the console. Provide a minimal example to give people something to work with. – Eryk Sun Aug 20 '18 at 02:12
  • Also, I don't follow what you're using `setlocale` for regarding UTF-8. The CRT only supports UTF-8 as the locale encoding in the latest release of Windows 10. On the other hand, the low I/O layer (i.e. `_wopen`, `_setmode`) supports an `_O_U8TEXT` mode that UTF-8 encodes wide-character strings when written to a non-console file, but uses the wide-character API for console I/O. – Eryk Sun Aug 20 '18 at 02:14
  • @eryksun I have used `SetConsoleOutputCP` and it has the same effect as typing `chcp` in the console. I think the minimum repro is just `printf(u8"你好\n")`. – Ben Jackson Aug 20 '18 at 02:40
  • chcp.com just calls `SetConsoleCP` and `SetConsoleOutputCP`. The input and output codepages are global in conhost.exe, not per application, so it doesn't matter whether you call the functions directly or run `chcp.com 65001`. Are you leaving the CRT locale in the default "C" mode in this case? – Eryk Sun Aug 20 '18 at 03:03
  • @eryksun I have left it alone and also tried setting it to `"chinese"` and `"chinese-simplified"` and verified that a valid locale is returned. My conclusion was that sort of advice was relevant for wide-char output, but it didn't make UTF-8 output work. – Ben Jackson Aug 20 '18 at 03:46
  • It's a serious problem that myprogram.exe produces no output (not even empty boxes or mojibake) when writing UTF-8 directly to the console. That's worked for me without problems in Windows 8+. In non-Unicode mode, setting the locale forces the CRT to decode from the Chinese codepage to UTF-16 and back to the console codepage, which is pointless when using UTF-8. If you don't mind switching to wide characters, try `_setmode(_fileno(stdout), _O_U8TEXT)` and then `wprintf(L"你好\n")`. That should use UTF-16 for the console and UTF-8 for a disk file or pipe, plus setting the locale won't interfere. – Eryk Sun Aug 20 '18 at 04:20
  • @eryksun It produces mojibake in a brand new command prompt. If I type `chcp 65001` in that window and run it again I get nothing (except some non-UTF8 pure ASCII timing info on `cerr`). It matters that the *very first* character is not ASCII. If I add "HELLO" to the front of the output I get that; the output stops at the first non-ASCII character. – Ben Jackson Aug 20 '18 at 04:31

1 Answers1

1

In a comment exchange with @eryksun I realized I had overlooked an experiment: All of my attempts to use wide characters had been successful. So what if type and echo are actually capable of reading UTF-8 and outputting wide characters? So I redirected to a file:

myprogram.exe | % {echo "$_"} > test.txt

Now inspecting that text file it is detected as "UCS-2 LE BOM" by Notepad++. In fact, all of the cases that worked (type, all redirection into files, etc) all produced multi-byte characters. Even type foo.u8 > foo.txt shows the expected increase in size.

So the real issue is not my program (which is successfully outputting UTF-8) it's that there are several things capable of silently transforming that into something Windows likes.

Ben Jackson
  • 90,079
  • 9
  • 98
  • 150
  • When myprogram.exe writes UTF-8 to a pipe, PowerShell converts it to a native UTF-16 string. It writes to the console using the wide-character API. PowerShell is particularly annoying when you want a direct, binary pipeline between programs. It sets itself up as a man in the middle, converting text encodings and CRLF line endings. I think the simplest option in that case is to use `cmd /c`. – Eryk Sun Aug 20 '18 at 04:12
  • 1
    I suggest reading the two comments links [I left on this question](https://stackoverflow.com/questions/51933189/character-encoding-utf-8-in-powershell-session). It should give you insight into powershell and encoding. – Maximilian Burszley Aug 21 '18 at 00:11