UTF-8 on Windows with Ada

Question

It is my understanding that by default, Character is Latin_1, Wide_Character is UCS-2, and Wide_Wide_Character is UCS-4, but that GNAT can have specified pragma Wide_Character_Encoding(UTF8); or -gnatW8 and that those characters and their strings will be UTF-8 encoded instead.

At least on Linux and FreeBSD, the results fit with my expectations. But on Windows the results are odd.

For either Wide or Wide_Wide variants, once a character moves beyond the ASCII set, I get a garbled mess. I beleive this is called emojibake by some. So I figured it was a codepage issue. After all, the default codepage in Windows, and therefore what the Console Host would load with, is 437 which isn't the UTF-8 codepage. chcp 65001 and now instead of the mess of extra characters, there's an immediate exception raised ADA.IO_EXCEPTIONS.DEVICE_ERROR : a-ztexio.adb:1295. Looking at where the exception occurred, it seems to be in the putc binding of fputc(). But this is Standard_Output, shouldn't an EOF never happen?

Is there some kind of special consideration Windows needs? How can I get UTF-8 output?

edit:
I tried piping the output into a text file. The supposed UTF-8 encoded program still generates emojibake in the file. Not sure why this would immediately throw an exception in the console though.

So then I tried directly opening and writing to a file instead of the console/pipe. Oddly this works exactly as it should. The text is completely correct.

I've never seen this kind of behavior with any other language, so it should still be possible to get proper UTF-8 at the console, right?

Possibly related? https://stackoverflow.com/questions/28486505/constraint-error-on-reading-a-file-containing/28487600#28487600 — , Feb 16 '18 at 15:53
Considering I've got it writing the correctly encoding text to a file already, no this isn't at all related. The issue has to do with the console, or Standard_Output, which has no form parameter. — Patrick Kelly, Feb 16 '18 at 15:58
The console's support for UTF-8 is horrible. It only allows reading 7-bit ASCII if the input codepage is set to 65001, and in Windows 7 and earlier it reports the wrong number of bytes written when writing, which confuses buffered writers. The only reliable option is to use the console's native UTF-16 encoding (well, naive UCS-2) -- e.g. `ReadConsoleW` and `WriteConsoleW` [w]ide-character functions -- and possibly transcode this to UTF-8 if needed for interacting with the rest of your language/library. — Eryk Sun, Feb 16 '18 at 16:02
The operating system Windows uses several code pages depending on the country. Java has a related problem when using the console. It converts String (Unicode) to the OS encoding, losing text by place holder replacements `?`. In your case you probably are seeing skewed UTF-8 sequences. — Joop Eggen, Feb 16 '18 at 16:04

score 1 · Accepted Answer · answered Feb 20 '18 at 07:16

The deficiency so many others, not just here, describe in the Windows Console Host has either been fixed or never existed in the first place. Based on this document, I feel it was probably always very misunderstood. Windows doesn't treat the console like files, and it's easy to fall into that trap.

Using this very straight forward code, along with what Windows needs and expects behind the scenes...

It correctly produces the following, as long as either pragma Wide_Character_Encoding(UTF8); or -gnatW8 is used.

Piping the output of this test program into a file works as it should. Similarly, piping the output of this test program into another program works as it should. And also similarly, taking the file from piped output, and piping it into another program works as it should.

Full UTF-8 behavior as one would expect under Linux, on Windows.

What needs to be done is twofold. In the package initializer, the Console Host needs to be told what it's working with, which can be done like this.

Character output is then done through fputwc. According to MS Docs fputc should never be used for UNICODE on Windows, which is part of the problem GNAT has. String output and character/string input is all similar.

You're attributing some behavior to Windows that's really the C runtime. Windows I/O doesn't have a distinction between text mode and binary mode. However, quirks of the C runtime notwithstanding, the console is a special kind of file in that you can't write UTF-16 to it via `WriteFile`, or read UTF-16 via `ReadFile`. As a regular file, it's restricted to text that's encoded with a legacy code page, which the console host decodes or encodes via `MultiByteToWideChar` and `WideCharToMultiByte`. There are so many bugs with how it handles this for codepage 65001, that it's simply not viable. — Eryk Sun, Feb 20 '18 at 12:07
Setting `u8text` mode on the file isn't telling the console host anything. It's an internal matter of the runtime library. Ultimately it needs to call `WriteConsoleW` and `ReadConsoleW`. These functions are implemented via special IOCTLs (i.e. `DeviceIoControl`), rather than directly via `WriteFile` and `ReadFile`. In this case the console can handle the buffers as wide-character strings, rather than having to decode/encode them using its active output or input codepage. This could have been handled cleaner, IMO, by providing a wide-character mode enabled via `SetConsoleMode`. — Eryk Sun, Feb 20 '18 at 12:19

Patrick Kelly · Answer 2 · 2018-02-20T07:05:25.367

0

Based on others comments and some further research to confirm, I'm pretty sure this is a deficiency of the Windows Console Host.

edit: don't listen to this

edited Feb 20 '18 at 07:05

answered Feb 16 '18 at 16:16

Patrick Kelly

633
5
22

The issue with encoding text when stdout is redirected to a pipe or file has nothing to do with the console. I don't know Ada, but if it works with Unicode text by default (e.g. like Python 3 or PowerShell), then it will also have a default encoding that it uses when writing to files, which is typically either ANSI or OEM on Windows. If so, then you have to figure out how to override this for stdout. – Eryk Sun Feb 17 '18 at 00:09
It's ANSI for Character, UCS-2 for Wide_Character, and UCS-4 for Wide_Wide_Character, basically. Except with `pragma Wide_Character_Encoding(UTF8);` where all three are supposed to be UTF-8 encoded. That pragma is what overrides the encoding (there are other options like Shift-JIS or brackets). – Patrick Kelly Feb 17 '18 at 02:47
You pass `fputc` a string? Is this explicitly a wide-character string? If so, is is it from a string literal in a source file, and if so what's the source file encoding (e.g. like handling `L""` literals in C)? If a wide-character string is printed to stdout, what's the default encoding used, or does it print the wide character directly? e.g., if stdout is redirected to a file (e.g. `program.exe > stdout.txt`), does printing a wide-character string result in encoded text (e.g. ANSI or OEM) or does it write UTF-16, possibly with a BOM? – Eryk Sun Feb 17 '18 at 03:20
There is no `fputc` in Ada. I assume you mean `Put`. This is explicitly a wide-character string if the `Put(Content : Wide_String)` version is called. Yes it's from a string literal. And yes the file is written in UTF-8. The three remaining questions have already been answered in my original question. – Patrick Kelly Feb 19 '18 at 15:01
I simply grabbed `fputc` from your question, but apparently that's part of a C binding or something. I was highlighting some common moving parts in this problem. If you're running something like `program.exe > stdout.txt` and using `Put` to print a wide-character string from a string literal in a UTF-8 file, in which the compiler is configured to handle the source as UTF-8, and it still "generates emojibake in the file", then you at least know the problem is somewhere in the implementation of `Put` when writing to `stdout` redirected to a file (not a pipe). – Eryk Sun Feb 19 '18 at 15:47
Well the specific exception is raised because `fputc` (the c function) returns EOF when trying to print to standard output. As I understand it, in that instance that means an output error occurred, but I don't have the slightest clue why. I'll try using the `fputc` binding directly and see what happens from there. – Patrick Kelly Feb 19 '18 at 16:15
Directly calling `fputc` yields the same behavior. Furthermore, the equivalent code in c, whether compiled with gcc or msvc, yields the same behavior. It seems based on the GNAT source code and Windows documentation, the correct function isn't being called on Windows (ms docs specifically say `fputc` doesn't support UNICODE, yet GNAT calls this for UNICODE), but also that the Windows Console Host is deficient. – Patrick Kelly Feb 19 '18 at 17:47
`fputwc` (not `fputc`) should be used for Unicode, but that's not enough. The file descriptor `_fileno(stdout)` needs to be switched via `_setmode` to either Unicode text mode (`_O_U16TEXT`, `_O_U8TEXT`) or binary mode (`_O_BINARY`). When writing to the console in Unicode text mode, the C runtime uses the wide-character function `WriteConsoleW`. – Eryk Sun Feb 19 '18 at 18:33
If you don't switch to Unicode text or binary mode, the file defaults to ANSI mode, which will try to convert via `wctomb_s`. If conversion fails it returns `WEOF`. If you're using the C locale, this is basically a cast from `wchar_t` to `char`, and it will fail beyond ordinal 255, so it's effectively Latin-1. Of course, writing Latin-1 to the console will be mojibake, since it will interpret the byte values according to its legacy codepage, which defaults to the OEM codepage. – Eryk Sun Feb 19 '18 at 18:48
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/165438/discussion-between-patrick-kelly-and-eryksun). – Patrick Kelly Feb 19 '18 at 20:22

UTF-8 on Windows with Ada

2 Answers2

Linked