3

It is possible to write Unicode characters to the Windows console using the WriteConsoleW function. On my Windows 7 machine, it looks like the console does not support characters outside the Basic Multilingual Plane. Also, combining characters are displayed after the base character, not actually combined.

Are these limitations also present in later versions of Windows? Are there other limitations on Unicode in the Windows console?

user200783
  • 13,722
  • 12
  • 69
  • 135
  • Windows knows what encoding a window uses by how the window was created - CreateWindowsExA (ansi) or CreateWindowsExW (Unicode). Windows converts from whatever the sending window is to whatever the receiving window is automatically. So sending a Unicode character to a Ansi window the Ansi window receives an Ansi character converted from Unicode. Console windows use DOS and the same conversions occur. So writeConsoleW will be converted to a Dos character if possible. –  Jun 17 '16 at 07:25
  • However the console can convert to any code page with `CHCP` command and the internal commands can output Unicode if `cmd` is started `cmd /u`. See `chcp /?` and `cmd /?`. –  Jun 17 '16 at 07:27
  • 2
    Windows console does not support complete `UTF-16`. "[It's limited to UCS-2, i.e. limited to characters in the basic multilingual plane (BMP)](http://stackoverflow.com/questions/1259084/what-encoding-code-page-is-cmd-exe-using#comment45630162_17177904)". Quotation: @eryksun Feb 23 '15 at 6:22 – JosefZ Jun 17 '16 at 09:24
  • 2
    And how the characters are displayed in the console window depends on the font used for the console window and which character sets from entire Unicode table the font supports. – Mofi Jun 17 '16 at 09:25
  • 1
    @Noodles, there is no DOS in NT; conhost.exe is the console, not cmd.exe; and chcp.com calls `SetConsoleCP` and `SetConsoleOutputCP` to set the codepage used by `WriteConsoleA`, etc. Internally the console uses Unicode, available with the native wide-character API such as `WriteConsoleW`. One can certainly write and read non-BMP characters to and from a console buffer (e.g. for copy/paste operations). You just can't properly *display* multi-word characters such as non-BMP characters and NFD decomposed characters (but usually for the latter transforming to NFC is possible). – Eryk Sun Jun 18 '16 at 04:29
  • If you start Character Map you will see console characters with the encoding called DOS (in other places known as OEM) how ever the user interface uses the term dos and that is what you should do.. –  Jun 18 '16 at 04:37
  • @Noodles, I have no problem with calling the OEM codepage the "DOS" codepage or referring to "DOS devices" (e.g. `NUL`) and conventions such as drive letters and path processing as "DOS" behaviors. NT's runtime library also refers to these as "DOS" in the RTL emulation APIs used by the Windows subsystem. I just wanted to clarify that there is no DOS code in modern Windows (unlike Windows 9x) because you said that "Console windows use DOS". As to cmd.exe (again, it's the shell, not the console), it comes from OS/2, not DOS, and the Windows port has always been Unicode on NT. – Eryk Sun Jun 18 '16 at 05:32
  • @Noodles, the "/u" switch affects how cmd.exe writes to files/pipes, not the console. For compatibility it's always defaulted to either the current console output codepage (defaults to OEM) or, if not attached to a console, the ANSI codepage when writing to a file/pipe because, especially when reading from a pipe on `stdin`, most command line programs don't detect UTF-16LE to switch to wide-character mode. But when reading and writing to the console, cmd.exe uses the wide-character APIs such as `ReadConsoleW`, `WriteConsoleW`, `FillConsoleOutputCharacterW`, and `ScrollConsoleScreenBufferW`. – Eryk Sun Jun 18 '16 at 05:43
  • 2
    Not sure why this question received -1, seems perfectly valid to me. I just wondered the same question, came here for clarification. – Malcolm Aug 14 '16 at 09:53

2 Answers2

6

I wrote a partial answer in my answer to a different question; here is a good place for a full disclosure. My background: I maintain what is in all probability the most extensive console font which fully supports Windows (it is a very deep rewrite of Unifont with elements of DejaVu added).

I start with the limitations already mentioned in other answers:

  • Every cell contains 16 bits of character data. In other words: only UCS-2 codepoints are shown. (In particular, for a character out of BMP, its “decomposition into UCS-2” is shown instead, using surrogate characters.)

  • only simple text rendering is supported. Even if one uses TTF fonts, no advanced “features” of the font are considered by the console. Neither advance typography (ligatures etc.), nor even composing glyphs for composing characters or right-to-left scripts¹⁾ (in LtR environment) would work as expected.

        ¹⁾ It is the application which should rearrange the characters for a correct bidi-rendering.

Font filtering

Other limitations are due to font filtering by a console. A font must be quite special to be accepted by the console (be shown in the font selection dialogue, and this selection “to work”¹⁾).

    ¹⁾ I do not recall whether a font may be shown, but won’t be selectable (I have vague memory of this happening, but cannot trust this memory).

  • The font must be marked as monospaced. Due to expectations of applications,²⁾ such fonts must have all the glyphs of the same width.

        ²⁾The latter condition is relevant only if one wants to use the font outside of console. In principle, the console does not check the widths of the glyphs. However, every glyph is shown as if it had the “default width”. In many (all?) situations only the part of the glyph inside the “default bounding box” is going to be shown. I could not find any trick to circumvent this limitation.

  • On non-EastAsian releases of Windows, the font cannot claim that it supports any one of 4 East Asian codepages.³⁾

        ³⁾ Note that this is only a limitation of what the font header claims — it is just 4 bits present in the header. The font may have glyphs for these languages present, and they would show fine — as far as the font does not claim the support. The codepages in question (in the OS/2⫽Charsets section of the header) are 932, 936, 949, 950 (JIS, Simplified Chinese, Korean Wansung, Traditional Chinese).

Bugs in font rendering

  • Although Windows’ console does not support Underline attribute (except for DBCS codepages), the “Underline position” field of the font header is taken into account when the size of the on-screen character bbox is calculated. This may lead to unexpected aspect ratio of the font, and/or to interruptions between glyphs which are expected to “join together”.

  • The console is very picky about the replacement glyph for “unsupported characters”. I could not find how to make such a glyph to coexist with presence of glyphs for U+0000 and/or U+0001. (If the console finds one of the latter two glyphs in a font, it ignores the replacement glyph.)

  • (This is a very obscure bug; it requires a very technical discussion.) Another problem with the replacement glyph is the character U+30FB ・ (WHY?!). If this character is present in the font, the glyph for this character is used as a replacement glyph — but only for missing characters in PUA!

Essentially, this is it! I did not find any other limitation.

Ilya Zakharevich
  • 1,210
  • 1
  • 9
  • 6
  • U+30FB (Katakana middle dot) is the replacement character for codepage 932 (Japanese) when decoding via `MultiByteToWideChar`. I think for most codepages it's "?" instead, except UTF-8 (65001) uses U+FFFD. This can be queried via [`GetCPInfoExW`](https://docs.microsoft.com/en-us/windows/desktop/api/winnls/nf-winnls-getcpinfoexw). This substitution will occur if an application writes an invalid byte sequence to the screen buffer via `WriteFile` or ANSI APIs such as `WriteConsoleA`. Offhand, I don't know how this default character relates to GDI's rendering of the *default glyph* for the PUA. – Eryk Sun Sep 20 '18 at 14:03
3

Windows console is limited to Basic Multilingual Plane

Your link to WriteConsole function says nothing about usable console characters:

  • lpBuffer [in] A pointer to a buffer that contains characters to be written to the console screen buffer.

But what is that buffer? Simple Google search for writeconsole lpbuffer structure gives (indirect) link to the CHAR_INFO structure:

Syntax (C++)

typedef struct _CHAR_INFO {
  union {
    WCHAR UnicodeChar;
    CHAR  AsciiChar;
  } Char;
  WORD  Attributes;
} CHAR_INFO, *PCHAR_INFO;

But what is WCHAR UnicodeChar? Again, a simple Google search for windows wchar gives link to Windows Data Types:

  • WCHAR A 16-bit Unicode character. For more information, see Character Sets Used By Fonts. This type is declared in WinNT.h as follows: typedef wchar_t WCHAR;

And finally, above Character Sets Used By Fonts link gives next ultimate consequence: Windows console is limited to Basic Multilingual Plane, i.e. 16-bit Unicode subset:

Unicode Character Set

… To address the problem of multiple coding schemes, the Unicode standard for data representation was developed. A 16-bit character coding scheme, Unicode can represent 65,536 (2^16) characters, which is enough to include all languages in computer commerce today, as well as punctuation marks, mathematical symbols, and room for expansion. Unicode establishes a unique code for every character to ensure that character translation is always accurate.

JosefZ
  • 28,460
  • 5
  • 44
  • 83
  • 1
    You're linking to the wrong function. [`CHAR_INFO`](https://msdn.microsoft.com/en-us/library/ms682013) is used by [`WriteConsoleOutput`](https://msdn.microsoft.com/en-us/library/ms687404), and similarly [`KEY_EVENT_RECORD`](https://msdn.microsoft.com/en-us/library/ms684166) is used by [`WriteConsoleInput`](https://msdn.microsoft.com/en-us/library/ms687403). [Character Sets Used By Fonts](https://msdn.microsoft.com/en-us/library/dd183415) doesn't mention the console, and it's out of date. Other than the console, Windows supports UTF-16 nowadays. – Eryk Sun Jun 18 '16 at 04:14
  • 2
    Unicode is **NOT** a 16-bit character encoding scheme, though it was initially envisaged as such. It is a 21-bit character scheme, with multiple encodings: UTF-8 and UTF-16 are the most common, though UTF-7, UTF-32, and others exist. The Windows console supports UCS-2, which is similar to UTF-16 except that it is a *subset* of Unicode. Because UCS-2 is fixed-width 16 bits, it is where the Windows Console's limitation of "only the Basic Multilingual Plane" comes from. Hope this helps clear up any confusion! – Forbin Oct 24 '19 at 12:05