5

Problem

The problem arises when I want to input Unicode character in the Python interpreter (for simplicity, I have used a-umlaut in the example, but I have first encountered this for Farsi characters). Whenever I use Python with the CHCP 65001 code page and then try to input even one Unicode character, Python exits without any error.

I have spent days trying to solve this problem to no avail. But today, I found a thread on the Python website, another on MySQL and another on Lua-users which issues were raised regarding this sudden exit, although without any solution and some saying that chcp 65001 is inherently broken.

It would be good to know once and for all whether this problem is chcp-design-related or there is a possible workaround.

Reproduce Error

chcp 65001

Python 3.X:

Python shell

print('ä')

result: it just exits the shell

However, this works python.exe -c "print('ä')" and also this: print('\u00e4')

result: ä

in Luajit2.0.4

print('ä')

Result: it just exits the shell

However this works: print('\xc3\xa4')

I have come up with this observation so far:

  1. direct output with the command prompt works.
  2. Unicode-based, hex-based equivalent of the character works.

So

This is not a Python bug and that we can't use a Unicode character directly in CLI programs in Windows command prompt or any of its wrappers like ConEmu, Cmder (I am using Cmder to be able to see and use Unicode character in Windows shell and I have done so without any problem). Is this correct?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
psychob
  • 63
  • 1
  • 6
  • I have a number of Python versions installed. I couldn't reproduce on Windows 10 64-bit, with Python 3.3.5 64-bit or Python 3.5.2 64-bit, but could with Python 2.7.12 32-bit. It exits as described, but you said you were using Python 3. Perhaps it is a 32- vs. 64-bit issue? Are you using a Windows cmd.exe console or something else? – Mark Tolonen Sep 28 '16 at 02:32
  • @MarkTolonen, this is reproducible in all Windows versions when using the console (conhost; cmd is just a shell), which wasn't designed for codepage 65001. Inputting a single non-ASCII character causes an empty read, which Python's REPL and `input` handle as EOF. The problem is conhost.exe assumes it's encoding its UTF-16 input buffer to an ANSI codepage with 1 byte per character, so for non-ASCII UTF-8 its `WideCharToMultiByte` encoding buffer is too small. The read fails, but it gets returned to the client as a 'successful' read of 0 bytes, i.e. end of file. – Eryk Sun Sep 28 '16 at 03:09
  • @eryksun, Yes, I know it is broken, but still, on 64-bit Python on 64-bit Windows 10, I can type at cmd.exe `print('ä')` with the international IME and it prints out correctly. So "reproducible in all Windows" version is inaccurate, at least for this specific example. – Mark Tolonen Sep 28 '16 at 03:24
  • @MarkTolonen, do you have pyreadline installed? – Eryk Sun Sep 28 '16 at 03:25
  • @eryksun, Yes, does that hack `WriteFile` too? – Mark Tolonen Sep 28 '16 at 03:29
  • @MarkTolonen Actually I both tried on cmd and Cmder(a wrapper of ConEmu) and both with the same outcome. – psychob Sep 28 '16 at 08:47
  • @eryksun Hmm..that's what I assumed to..because it all boils down to the way the software (here python 3.4.x) read/write characters, right? so for example if I create a bare-bone echo written in C, and I use Wide character version of input output function it should work?\ – psychob Sep 28 '16 at 08:51
  • BTW, @eryksun your explanations answered my question. Many Thanks. – psychob Sep 28 '16 at 08:53

1 Answers1

14

To use Unicode in the Windows console for Python 2.7 and 3.x (prior to 3.6), install and enable win_unicode_console. This uses the wide-character functions ReadConsoleW and WriteConsoleW, just like other Unicode-aware console programs such as cmd.exe and powershell.exe. For Python 3.6, a new io._WindowsConsoleIO raw I/O class has been added. It reads and writes UTF-8 encoded text (for cross-platform compatibility with Unix -- "get a byte" -- programs), but internally it uses the wide-character API by transcoding to and from UTF-16LE.

The problem you're experiencing with non-ASCII input is reproducible in the console for all Windows versions up to and including Windows 10. The console host process, i.e. conhost.exe, wasn't designed for UTF-8 (codepage 65001) and hasn't been updated to support it consistently. In particular, non-ASCII input causes an empty read. This in turn causes Python's REPL to exit and built-in input to raise EOFError.

The problem is that conhost encodes its UTF-16 input buffer assuming a single-byte codepage, such as the OEM and ANSI codepages in Western locales (e.g. 437, 850, 1252). UTF-8 is a multibyte encoding in which non-ASCII characters are encoded as 2 to 4 bytes. To handle UTF-8 it would need to encode in multiple iterations of M / 4 characters, where M is the remaining bytes available from the N-byte buffer. Instead it assumes a request to read N bytes is a request to read N characters. Then if the input has one or more non-ASCII characters, the internal WideCharToMultiByte call fails due to an undersized buffer, and the console returns a 'successful' read of 0 bytes.

You may not observe exactly this problem in Python 3.5 if the pyreadline module is installed. Python 3.5 automatically tries to import readline. In the case of pyreadline, input is read via the wide-character function ReadConsoleInputW. This is a low-level function to read console input records. In principle it should work, but in practice entering print('ä') gets read by the REPL as print(''). For a non-ASCII character, ReadConsoleInputW returns a sequence of Alt+Numpad KEY_EVENT records. The sequence is a lossy OEM encoding, which can be ignored except for the last record, which has the input character in the UnicodeChar field. Apparently pyreadline ignores the entire sequence.

Prior to Windows 8, output using codepage 65001 is also broken. It prints a trail of garbage text in proportion to the number of non-ASCII characters. In this case the problem is that WriteFile and WriteConsoleA incorrectly return the number of UTF-16 codes written to the screen buffer instead of the number of UTF-8 bytes. This confuses Python's buffered writer, leading to repeated writes of what it thinks are the remaining unwritten bytes. This problem was fixed in Windows 8 as part of rewriting the internal console API to use the ConDrv device instead of an LPC port. Older versions of Windows can use ConEmu or ANSICON to work around this bug.

Eryk Sun
  • 33,190
  • 5
  • 92
  • 111
  • Your description is very helpful, but one part of it is wrong: input of “exotic characters” (those not in the current keyboard layout). I wrote what I know about it in (https://stackoverflow.com/a/47843552/9106292) and (https://stackoverflow.com/a/47852866/9106292) – Ilya Zakharevich Sep 20 '18 at 03:54
  • @IlyaZakharevich, I read your posts, but I'm still not completely certain what I have wrong in this answer, unless it's simply my characterization as "non-ASCII" input as opposed to what's specifically available in the current keyboard mapping. – Eryk Sun Sep 20 '18 at 14:18
  • «For a non-ASCII character, `ReadConsoleInputW` returns a sequence of Alt+Numpad `KEY_EVENT` records.» This is wrong (or at least not completely correct ;). The described behaviour happens only for characters which are “not present” in the “main plane” of the keyboard. If a character can be accessed with (whatever complicated) combination of modifiers (but does not need prefix keypresses!), it would be faked as entered this way. – Ilya Zakharevich Sep 23 '18 at 04:07
  • The application would see a certain keypress with certain modifiers (IIRC, with at most `Shift`-`AltGr` modifiers faked; if *any* extended modifier is needed [including `Ctrl`], `AltGr` is substituted instead). – Ilya Zakharevich Sep 23 '18 at 04:10
  • @IlyaZakharevich, thank you. That's what I thought you meant. I'll look into it to characterize what the console is doing here. Previously I just tried some random non-ASCII characters and observed that they used an Alt+Numpad sequence for the best-fit OEM encoding, with the actual Unicode code point stored in the last record. – Eryk Sun Sep 23 '18 at 14:52