0

The following program is very simple. It launches a subprocess which runs a Windows port of the Unix utility less.

import subprocess
subprocess.run('less.exe', input='Macarrão é uma delícia.', encoding='utf-8')

The input is:

Macarrão é uma delícia.

The output, though, comes out as:

Macarrão é uma delícia.

What is the explanation for this? I have noticed that running chcp 65001 before running my python code fixes the problem, but looking through a related post I'm not sure it's the best way to go about it. Quoting from the accepted answer:

chcp 65001 is very dangerous. Unless a program was specially designed to work around defects in the Windows’ API (or uses a C runtime library which has these workarounds), it would not work reliably. Win8 fixes ½ of these problems with cp65001, but the rest is still applicable to Win10.

I'm running Python 3.7.0 on Windows 10 64-bit.

bzrr
  • 1,490
  • 3
  • 20
  • 39
  • 1
    less.exe apparently uses the legacy console interface via `WriteFile` or `WriteConsoleA`, which depends on the current console codepage. You can temporarily switch the codepage with a try/finally statement. Get the current codepage via `GetConsoleOutputCP()`; set the new codepage via `SetConsoleOutputCP(65001)`; `try` to run less.exe; and `finally` restore the old codepage. Setting the console output codepage to UTF-8 works fine in Windows 8+. It's buggy in Windows 7 when used with buffered writers such as C `FILE` streams and Python file objects. – Eryk Sun Apr 14 '19 at 06:11
  • @eryksun I see. Could you post that as answer with an example on how to do what you're talking about in python? – bzrr Apr 14 '19 at 06:29
  • 1
    Use ctypes. These are simple calls, so we don't even need to define prototypes. The setup is `import ctypes;` `kernel32 = ctypes.WinDLL('kernel32', use_last_error=True)`. Then you can call the functions I mentioned as attributes, e.g. `prev_codepage = kernel32.GetConsoleOutputCP()`. – Eryk Sun Apr 14 '19 at 06:37
  • @ErykSun Just out of curiosity, since WriteFile is not appropriate for Unicode, which function should be have been used? – bzrr Dec 09 '19 at 21:17
  • 1
    `WriteFile` is generally fine for Unicode, as long we're writing a Unicode transport format such as UTF-8 or UTF-16. It's just encoded bytes. – Eryk Sun Dec 09 '19 at 22:59
  • 1
    That said, with the console in particular, its input and and screen buffers are UTF-16 (actually UCS-2), so UTF-8 has to be transcoded to UTF-16. Prior to Windows 8, when writing UTF-8 to a screen buffer, the console would mistakenly return the number of UTF-16 codes written instead of the number of UTF-8 bytes written, which is dysfunctional for a buffered writer such as a C `FILE` a stream. For reading input it's worse, even in Windows 10. The console does not support encoding non-ASCII characters as UTF-8, so the input codepage should never be set to UTF-8 via `SetConsoleCP(CP_UTF8)`. – Eryk Sun Dec 09 '19 at 23:02
  • 1
    The alternative for the best possible Unicode support is to use the console's wide-character API, such as `ReadConsoleW` and `WriteConsoleW`. This API works with the native UTF-16 strings of Windows, so it avoids the problems the console has when transcoding UTF-8 to UTF-16. Bear in mind that the console is still primarily limited to UCS-2 -- i.e. the basic multilingual plane, non-complex scripts, and precomposed characters. The new Terminal application in Windows has broader Unicode support, such as for non-BMP characters such as emojis. – Eryk Sun Dec 09 '19 at 23:07

2 Answers2

1

As suggested by eryk, one way is to set the console codepage to UTF-8, run less.exe and set the codepage back to what it was previously.

import subprocess
from ctypes import windll

prev_codepage = windll.kernel32.GetConsoleOutputCP()
windll.kernel32.SetConsoleOutputCP(65001)
subprocess.run("less.exe", input='Macarrão é uma delícia', encoding='utf-8')
windll.kernel32.SetConsoleOutputCP(prev_codepage)
bzrr
  • 1,490
  • 3
  • 20
  • 39
  • 1
    But in general do not use `windll` in libraries. It caches `WinDLL` instances, which cache function pointer instances, which leads to prototype conflicts between packages. It also doesn't allow setting `use_last_error=True` for getting the last error *reliably* via `ctypes.get_last_error()`. – Eryk Sun Apr 14 '19 at 07:09
0

To complement your own answer with a simpler alternative - although I don't understand why it works; tested on Windows 10 with Python 3.8:

import os
os.system('echo Macarrão é uma delícia.| less.exe')

On Windows, os.system() calls cmd.exe (via env. var. ComSpec) and even though a cmd.exe instance created this way still reports the system's OEM code page as the active code page, the command works as desired.

mklement0
  • 382,024
  • 64
  • 607
  • 775