1

What is the default console encoding on Windows? It seems like sometimes it is the ANSI encoding (CP-1252), sometimes it is the OEM encoding (CP-850 for Western Europe by default) given by the chcp command.

  • Command-line arguments and environment variables trigger the ANSI encoding (é = 0xe9):

    > chcp 850
    Active code page: 850
    > python -c "print 'é'"
    Ú
    > python -c "print '\x82'"
    é
    > python -c "print '\xe9'"
    Ú
    > $env:foobar="é"; python -c "import os; print os.getenv('foobar')"
    Ú
    
    > chcp 1252
    Active code page: 1252
    > python -c "print 'é'"
    é
    > python -c "print '\x82'"
    ,
    > python -c "print '\xe9'"
    é
    > $env:foobar="é"; python -c "import os; print os.getenv('foobar')"
    é
    
  • Python console and standard input trigger the OEM encoding (é = 0x82 if the OEM encoding is CP-850, é = 0xe9 if the OEM encoding is CP-1252):

    > chcp 850
    Active code page: 850
    > python
    >>> print 'é'
    é
    >>> print '\x82'
    é
    >>> print '\xe9'
    Ú
    > python -c "print raw_input()"
    é
    é
    
    > chcp 1252
    Active code page: 1252
    > python
    >>> print 'é'
    é
    >>> print '\x82'
    ,
    >>> print '\xe9'
    é
    > python -c "print raw_input()"
    é
    é
    

Note. – In these examples, I used Powershell 5.1 and CPython 2.7.14 on Windows 10.

Géry Ogam
  • 6,336
  • 4
  • 38
  • 67

1 Answers1

3

First of all, for all your non-ASCII characters, what matters here is your console encoding and Windows locale settings, you are using byte strings and Python just prints out the bytes it received. Your keyboard input is encoded to a specific byte or byte sequence by the console before those bytes are passed on to Python. To Python, this is all just opaque data (numbers in the range 0-255), and print passes those back to the console the same way Python received them.

In Powershell, what encoding is used for the bytes sent to Python via command-line switches is not determined by the chcp codepage, but by the Language for non-Unicode programs setting in your control panel (search for Region, then find the Administrative tab). It is this setting that encodes é to 0xE9 before passing it to Python as a command-line argument. There are a large number of Windows codepages that use 0xE9 for é (but there is no such thing as an ANSI encoding).

The same applies to environment variables. Python refers to the encoding Windows uses here as the MBCS codec; you can decode command-line parameters or environment variables to Unicode using the 'mbcs' codec, which uses the MultiByteToWideChar() and WideCharToMultiByte() Windows API functions, with the CP_ACP flag.

When using the interactive prompt, Python is passed bytes as encoded by the Powershell console locale codepage, set with chcp. For you that's codepage 850, and a byte with the hex value 0x82 is received when you type é. Because print sends the same 0x82 byte back to the same console, the console then translates 0x82 back to a é character on the screen.

Only when you use Unicode text (with a unicode string literal like u'é') would Python do any decoding and encoding of the data. print writes to sys.stdout, which is configured to encode Unicode data to the current locale (or PYTHONIOENCODING if set), so print u'é' would write that Unicode object to sys.stdout, which then encodes that object to bytes using the configured codec, and those bytes are then written to the console.

To produce the unicode object from the u'é' source code text (itself a sequence of bytes), Python does have to decode the source code given. For the -c command line, the bytes that are passed in are decoded as Latin-1. In the interactive console, the locale is used. So python -c "print u'é'" and print u'é' in the interactive session will result in different output.

It should be noted that Python 3 uses Unicode strings throughout, and command-line parameters and environment variables are loaded into Python with the Windows 'wide' APIs to access the data as UTF-16, then presented as Unicode string objects. You can still access console data and filesystem information as byte strings, but as of Python 3.6, accessing the filesystem and stdin/stdout/stderr streams as binary uses UTF-8 encoded data (again using the 'wide' APIs).

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Thank you. I have updated my question to include an example with an environment variable and to replace the xxd (hex dump program included in Vim) example with a pure Python example. – Géry Ogam Mar 06 '18 at 22:42
  • @Maggyero: please don't move the goal posts. Windows passes environment variables to a process using the same codec as command-line arguments. – Martijn Pieters Mar 07 '18 at 11:18
  • Great edit, I'll accept your answer but before I need to finish some tests (to make sure I understand everything) and maybe to edit the parts of your post that I didn't get right away. – Géry Ogam Mar 07 '18 at 16:24
  • What is the `chg` codepage that you are talking about in the second paragraph? Is it a typo? – Géry Ogam Mar 09 '18 at 07:10
  • @Maggyero yes, that should be `chcp`. Corrected. – Martijn Pieters Mar 09 '18 at 08:24
  • 1
    Windows itself executes a process only with a Unicode command line and environment in the Process Environment Block (PEB). Python 2 relies on the within-process C runtime library to parse/provide the legacy ANSI-encoded command-line and environment from `GetCommandLineA` and `GetEnvironmentStringsA`. You can install [win_unicode_console](https://pypi.python.org/pypi/win_unicode_console) to use the console's Unicode interface and replace `sys.argv` with parsed arguments from the native Unicode command line. – Eryk Sun Mar 25 '18 at 22:13
  • Also, the console that Python may inherit from its parent (CMD, PowerShell), if the parent is attached to a console, has nothing otherwise to do with the parent process. It's an OS resource provided by files on the ConDrv device (or fake files and an LPC port prior to Windows 8) and an instance of the console host process, conhost.exe (or csrss.exe prior to Windows 7). – Eryk Sun Mar 25 '18 at 22:15