7

I'm trying to redirect output of python script to a file. When output contains non-ascii characters it works on macOS and Linux, but not on Windows.

I've deduced the problem to a simple test. The following is what is shown in Windows command prompt window. The test is only one print call.

Microsoft Windows [Version 10.0.17134.472]
(c) 2018 Microsoft Corporation. All rights reserved.

D:\>set PY
PYTHONIOENCODING=utf-8

D:\>type pipetest.py
print('\u0422\u0435\u0441\u0442')

D:\>python pipetest.py
Тест

D:\>python pipetest.py > test.txt

D:\>type test.txt
Тест

D:\>type test.txt | iconv -f utf-8 -t utf-8
Тест

D:\>set PYTHONIOENCODING=

D:\>python pipetest.py
Тест

D:\>python pipetest.py > test.txt
Traceback (most recent call last):
  File "pipetest.py", line 1, in <module>
    print('\u0422\u0435\u0441\u0442')
  File "C:\Python\Python37\lib\encodings\cp1252.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_table)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 0-3: character maps to <undefined>

D:\>python -V
Python 3.7.2

As one can see setting PYTHONIOENCODING environment variable helps but I don't understand why it needed to be set. When output is terminal it works but if output is a file it fails. Why does cp1252 is used when stdout is not a console?

Maybe it is a bug and can be fixed in Windows version of python?

sancoder
  • 93
  • 6
  • 1
    Windows Python defaults to the system ANSI encoding for text files. Except if the file is a console, 3.6+ uses the console's Unicode (UTF-16) API and pretends that it's UTF-8 for the `buffer` and `raw` interfaces. – Eryk Sun Dec 31 '18 at 05:17
  • In Windows 10, you can configure the system ANSI/OEM codepages as UTF-8 (65001). This wasn't possible in previous versions. – Eryk Sun Dec 31 '18 at 05:18
  • I don't see how `iconv -f utf-8 -t utf-8` could produce the correct output. What are the bytes in the file and which encoding produces the output you see from `type`? (We can deduce one given the other.) – tripleee Dec 31 '18 at 05:20
  • Windows Python should not know if it is writing to a console or the output is redirected/pipelined. Using Unicode API (ending with W, eg. CreateFileW) it is possible to write all range of Unicode characters. Changing system ANSI/OEM codepage to UTF-8 (cp65001) indeed helps. It is marked beta though and not really for production. – sancoder Jan 02 '19 at 16:00
  • Does this answer your question? [UnicodeEncodeError in python3 when redirection is used](https://stackoverflow.com/questions/59779618/unicodeencodeerror-in-python3-when-redirection-is-used) – K3---rnc Jan 20 '21 at 13:08
  • 1
    @K3---rnc No, the referenced answer is just a workaround to use environment variable but not the root cause of the problem. That's why I raised the question - why there's need for PYTHONENCODING variable? Why on Windows python needs to know encoding? Isn't there a Unicode API in Windows somewhere? – sancoder Jan 21 '21 at 17:09

2 Answers2

6

Based on Python documentation, Windows version use different character encoding on console device (utr-8) and non-character devices such as disk files and pipes (system locale). PYTHONIOENCODING can be used to override it.

https://docs.python.org/3/library/sys.html#sys.stdout

Another method is change the encoding directly in the program, I tried and it works fine.

sys.stdout.reconfigure(encoding='utf-8')

https://docs.python.org/3/library/io.html#io.TextIOWrapper.reconfigure

Eric Leung
  • 161
  • 1
  • 3
0

Python needs to write binary data to stdout (not strings) hence requirement for encoding parameter.

Encoding (used to convert strings into bytes) is determined differently for each platform:

  • on Linux and macOS it comes from current locale;
  • on Windows what is used is "Current language for non-Unicode programs" (codepage set in command line window is irrelevant).

(Thanks to @Eric Leung for precise link)

The follow up question would be why Python on Windows uses current system locale for non-Unicode programs, and not what is set by chcp command, but I will leave it for someone else.

Also it needs to be mentioned there's a checkbox titled "Beta: Use Unicode UTF-8..." in Region Settings on Windows 10 (to open - Win+R, type intl.cpl). By checking the checkbox the above example works without error. But this checkbox is off by default and really deep in system settings.

sancoder
  • 93
  • 6