0

chcp 65001 and set PYTHONIOENCODING=utf-8 seems to break input in the console..?:

>>> import sys;sys.stdout.write(u'æøåÆØÅ')

  File "<stdin>", line 1
    import sys;sys.stdout.write(u'
                                 ^
SyntaxError: EOL while scanning string literal
>>> ^Z

Here is the entire session (as an image to prevent unicode issues):

enter image description here

Using ConEmu doesn't seem to make a difference (on the input issue):

enter image description here

Did I miss a needed incantation?

This is not a duplicate of Unicode characters in Windows command line - how? since that question is about:

..this is about passing unicode command line arguments, rather than displaying text in the console. Console might not get involved at all

thebjorn
  • 26,297
  • 11
  • 96
  • 138
  • Possible duplicate of [Unicode characters in Windows command line - how?](https://stackoverflow.com/questions/388490/unicode-characters-in-windows-command-line-how) – ivan_pozdeev Oct 28 '17 at 02:18
  • If you need Unicode in Windows Console, you're better off using alternative consoles like Console2. – ivan_pozdeev Oct 28 '17 at 02:25
  • @ivan_pozdeev it's definitely not a duplicate of that question. I'm using ConEmu in the second screen shot, are you saying that Console2 will work better..? – thebjorn Oct 28 '17 at 05:25
  • 1
    Python 3.6 will work better and you won't need to mess with `chcp` or `PYTHONIOENCODING`. – Mark Tolonen Oct 28 '17 at 05:30
  • @MarkTolonen this is for a sync/build tool that we'll need, amongst other things, to transition our 160+Kloc codebase from Python 2, so 3.6 may be interesting in the future, but not right now.. Python 2.7.x is still supported as far as I can see..? – thebjorn Oct 28 '17 at 05:37
  • 1
    `utf8` is broken in the Windows console. I'm guessing but the characters are probably not received as proper utf-8, so the input sequence drops them. I had to hit an extra to get the error message. Python 3.6 is the fix. It uses Windows native Unicode calls and bypasses dealing with code pages. Python 2 support ends in 2020. For the specific characters you are using, `chcp 1252` works, and you also won't need PYTHONIOENCODING. – Mark Tolonen Oct 28 '17 at 05:42
  • @MarkTolonen you're saying that Py2 support for unicode in the windows console is broken (if both Py3.6 and node can use the same console to output unicode). 2020 is still several years into the future and converting a large code-base to a backwards incompatible version is neither trivial nor quick. – thebjorn Oct 28 '17 at 05:45
  • 1
    @MarkTolonen is right, there's a reason the Python devs went to the trouble of using native Unicode I/O for the console in 3.6 - it's because UTF-8 in Windows is simply broken. – Mark Ransom Oct 28 '17 at 05:46
  • @MarkRansom I'm not sure how using native Unicode I/O calls to implement Unicode console I/O behavior is indicative of the console being broken for unicode..? – thebjorn Oct 28 '17 at 05:49
  • Python 3.6+ uses Win32 `WriteConsoleW` for example, instead of `WriteConsoleA`, because the UTF-8 code page is broken. – Mark Tolonen Oct 28 '17 at 05:50
  • 3.6 bypasses the standard byte-level I/O interface and replaces it with calls to native Windows functions that take UTF-16. – Mark Ransom Oct 28 '17 at 05:51
  • @MarkTolonen I thought the sentiment that cp65001 was broken had been debunked years ago (e.g. in the bug report for implementing an encoding for cp65001)? – thebjorn Oct 28 '17 at 05:53
  • @MarkRansom I'm not sure what you're trying to say or how it relates to this question..? – thebjorn Oct 28 '17 at 05:54
  • https://bugs.python.org/issue1602 status resolved in Version 3.6. Specifically, read https://bugs.python.org/issue1602#msg88077 about bugs in Win32 ReadFile and WriteFile. The ReadFile bug sounds like what you are seeing. – Mark Tolonen Oct 28 '17 at 05:59
  • It relates to this question because the Python developers had been battling UTF-8 problems in Python for years, and finally threw in the towel by coming up with an entirely different approach. That new approach will *never* be back-ported to 2.7. – Mark Ransom Oct 28 '17 at 06:04
  • @MarkTolonen I was thinking more https://bugs.python.org/issue6058 – thebjorn Oct 28 '17 at 06:06
  • @MarkRansom in that case you should make it an answer. – thebjorn Oct 28 '17 at 06:07
  • If the console's input codepage is 65001, then "æøåÆØÅ" is read as "\x00\x00\x00\x00\x00\x00" in Windows 10. Reading non-ASCII characters as ASCII NUL is actually an improvement over Windows 7 and 8, in which `ReadConsoleA` and `ReadFile` simply return 'success' with 0 characters read (i.e. EOF) if the input has non-ASCII characters. The problem is an assumption in conhost.exe that the input codepage has a constant 1 byte per character, which causes `WideCharToMultiByte` to fail in Windows 7/8. In Windows 10, the console substitutes NUL for every non-ASCII character. – Eryk Sun Oct 28 '17 at 09:08
  • 2
    In Python 2 you can get the full Unicode range if you use the win_unicode_console module. This calls the wide-character `ReadConsoleW` and `WriteConsoleW` functions instead of using the legacy codepage. – Eryk Sun Oct 28 '17 at 09:12
  • 1
    @eryksun win_unicode_console seems to work (I'm hesitant to say perfectly, but I haven't found any issues after banging on it for a while!) I'll accept it if you write it as an answer -- I'm sure it would be very useful for many more than just me.. (that's a character that survived a subprocess call to npm, with my code page set to 437 :-) – thebjorn Oct 28 '17 at 10:27
  • Regarding "" (U+01F64C), assuming you're using the standard Python 2 subprocess, it's limited to ASCII for `unicode` command lines. If you used `str` with `win_unicode_console`, then it encoded it as the UTF-8 string `'\xf0\x9f\x99\x8c'`. In Python 2, subprocess then calls `CreateProcessA`, which decodes the command line as an ANSI string. For example if the ANSI codepage is 1252, it gets decoded as `u'\xf0\u0178\u2122\u0152'` (i.e. `u"🙌"`). This would imply npm reads the command line encoded back to ANSI from `main` or `GetCommandLineA` and for some reason interprets it as UTF-8. – Eryk Sun Oct 28 '17 at 20:17

0 Answers0