13

I'm working on a python application that can print text in multiple languages to the console in multiple platforms. The program works well on all UNIX platforms, but in windows there are errors printing unicode strings in command-line.

There's already a relevant thread regarding this: ( Windows cmd encoding change causes Python crash ) but I couldn't find my specific answer there.

For example, for the following Asian text, in Linux, I can run:

>>> print u"\u5f15\u8d77\u7684\u6216".encode("utf-8")
引起的或

But in windows I get:

>>> print u"\u5f15\u8d77\u7684\u6216".encode("utf-8")
σ╝ץΦ╡╖τתהµטצ

I succeeded displaying the correct text with a message box when doing something like that:

>>> file("bla.vbs", "w").write(u'MsgBox "\u5f15\u8d77\u7684\u6216", 4, "MyTitle"'.encode("utf-16"))
>>> os.system("cscript //U //NoLogo bla.vbs")

But, I want to be able to do it in windows console, and preferably - without requiring too much configuration outside my python code (because my application will be distributed to many hosts).

Is this possible?

Edit: If it's not possible - I would be happy to accept some other suggestions of writing a console application in windows that displays unicode, e.g. a python implementation of an alternative windows console

Community
  • 1
  • 1
yonix
  • 11,665
  • 7
  • 34
  • 52
  • 2
    simple answer is no. Python output is byte oriented but windows uses UCS2 and the two don't mix. It's a big problem but Python is not alone in not playing nice with windows console.ice with windows console. – David Heffernan Jul 17 '11 at 16:56
  • 4
    Intuitively I'd say that the `encode` to UTF-8 is rubbish on Windows. All Windows API calls are Unicode-oriented and use UTF-16; the UTF-8 conversion sounds like the right thnig to do on Linux with a UTF-8 locale but that's just because the output happens to resemble what the system then accepts as text. Interestingly, just printing the Unicode string complains about unconvertible characters, despite the console being perfectly capable of printing those characters (even though it might not have a suitable glyph in Lucida Console or Consolas). – Joey Jul 17 '11 at 17:06
  • After reading that issue posted by eryksun, I must say that the Windows console really needs to just be Unicode. Do away with the code pages and use proper encodings. It would make things so much easier for programmers. Cross-platform incompatibilities are to be expected, but not something as simple as the console... –  Jul 17 '11 at 18:48
  • 3
    @chrono The Windows console is Unicode and has been since NT was released nearly 20 years ago. There are no code pages and locales. It uses proper encodings. The problem is that Python expects a *nix type environment and has not adapated to Windows. The problems and limitations are all with Python. – David Heffernan Jul 17 '11 at 19:18
  • So why is it that I can paste Unicode text into a Windows console and have it be accepted by a Python program even if the text doesn't properly display inside the console? (Haven't tried this on XP, but I know it works on Vista and 7.) – JAB Jul 20 '11 at 14:37
  • 1
    @David Heffernan I'm afraid you partially incorrect there. There are some major limitations in how programs interact with the console. WriteFile and the CRT has issues with Unicode. The default font on the console window doesn't handle Unicode characters. (http://blogs.msdn.com/b/michkap/archive/2011/06/08/10172411.aspx) – jveazey Aug 13 '11 at 10:25
  • @jveazey regarding the font are you talking about 10 year old xp? – David Heffernan Aug 15 '11 at 15:18
  • @David Heffernan Nope. I tested on Windows 7 x64 SP1 using Visual C++ 2010. – jveazey Aug 15 '11 at 20:43
  • @jveazey I'm surprised by that. Which font is in use in your console? – David Heffernan Aug 16 '11 at 10:08
  • @David Heffernan The default is "Raster Fonts". If you switch to Consolas, it works correctly. You really should read this article... (http://blogs.msdn.com/b/michkap/archive/2011/06/08/10172411.aspx) Here's a relevant quote "The Microsoft Visual C Runtime DLL console functions have been broken for most of that time; they started to get fixed in the VS2005 timeframe and have slowly been getting better and better though even the latest versions (VS2010 and Windows 7) are still not totally working right;" – jveazey Aug 16 '11 at 17:04
  • @jveazey I would point out that MSVCRT != Win32. According to Kaplan Windows Unicode console APIs have been fine since NT4. I always change my console font to Consolas - too bad modern Windows can't default to that. – David Heffernan Aug 16 '11 at 17:16
  • 1
    @David Heffernan That's why my initial statement was "partially incorrect". The console functions _do_ work, but there are still numerous Unicode issues with the console, in general. WriteFile, ReadFile, CRT, Powershell, redirected handles, default font and others. – jveazey Aug 16 '11 at 18:51
  • @jveazey & @DavidHeffernan, Windows 8 fixes many of the console problems. It uses a console device now (NT path `\Device\ConDrv`) instead of only using the LPC connection to conhost.exe, so many LPC-based server calls such as `CloseConsoleHandle` aren't needed anymore. Instead it makes system calls such as `NtWriteFile` (by bye LPC shared heap). They also stopped making naive codepage assumptions in conhost.exe, so I had none of the problems with codepage 65001 that I have on Windows 7. Font support and the window are still horrible, but use ConEmu to replace the ugly old window. – Eryk Sun Oct 08 '14 at 18:41

5 Answers5

3

There's a WriteConsoleW solution that provides a unicode argv and stdout (print) but not stdin: Windows cmd encoding change causes Python crash

The only thing I modified is sys.argv to keep it unicode. The original version utf-8 encoded it for some reason.

#!/usr/bin/env python
# -*- coding: utf-8 -*-

""" https://stackoverflow.com/questions/878972/windows-cmd-encoding-change-causes-python-crash#answer-3259271
"""

import sys

if sys.platform == "win32":
    import codecs
    from ctypes import WINFUNCTYPE, windll, POINTER, byref, c_int
    from ctypes.wintypes import BOOL, HANDLE, DWORD, LPWSTR, LPCWSTR, LPVOID

    original_stderr = sys.stderr

    # If any exception occurs in this code, we'll probably try to print it on stderr,
    # which makes for frustrating debugging if stderr is directed to our wrapper.
    # So be paranoid about catching errors and reporting them to original_stderr,
    # so that we can at least see them.
    def _complain(message):
        print >>original_stderr, message if isinstance(message, str) else repr(message)

    # Work around <http://bugs.python.org/issue6058>.
    codecs.register(lambda name: codecs.lookup('utf-8') if name == 'cp65001' else None)

    # Make Unicode console output work independently of the current code page.
    # This also fixes <http://bugs.python.org/issue1602>.
    # Credit to Michael Kaplan <http://www.siao2.com/2010/04/07/9989346.aspx>
    # and TZOmegaTZIOY
    # <https://stackoverflow.com/questions/878972/windows-cmd-encoding-change-causes-python-crash/1432462#1432462>.
    try:
        # <http://msdn.microsoft.com/en-us/library/ms683231(VS.85).aspx>
        # HANDLE WINAPI GetStdHandle(DWORD nStdHandle);
        # returns INVALID_HANDLE_VALUE, NULL, or a valid handle
        #
        # <http://msdn.microsoft.com/en-us/library/aa364960(VS.85).aspx>
        # DWORD WINAPI GetFileType(DWORD hFile);
        #
        # <http://msdn.microsoft.com/en-us/library/ms683167(VS.85).aspx>
        # BOOL WINAPI GetConsoleMode(HANDLE hConsole, LPDWORD lpMode);

        GetStdHandle = WINFUNCTYPE(HANDLE, DWORD)(("GetStdHandle", windll.kernel32))
        STD_OUTPUT_HANDLE = DWORD(-11)
        STD_ERROR_HANDLE = DWORD(-12)
        GetFileType = WINFUNCTYPE(DWORD, DWORD)(("GetFileType", windll.kernel32))
        FILE_TYPE_CHAR = 0x0002
        FILE_TYPE_REMOTE = 0x8000
        GetConsoleMode = WINFUNCTYPE(BOOL, HANDLE, POINTER(DWORD))(("GetConsoleMode", windll.kernel32))
        INVALID_HANDLE_VALUE = DWORD(-1).value

        def not_a_console(handle):
            if handle == INVALID_HANDLE_VALUE or handle is None:
                return True
            return ((GetFileType(handle) & ~FILE_TYPE_REMOTE) != FILE_TYPE_CHAR
                    or GetConsoleMode(handle, byref(DWORD())) == 0)

        old_stdout_fileno = None
        old_stderr_fileno = None
        if hasattr(sys.stdout, 'fileno'):
            old_stdout_fileno = sys.stdout.fileno()
        if hasattr(sys.stderr, 'fileno'):
            old_stderr_fileno = sys.stderr.fileno()

        STDOUT_FILENO = 1
        STDERR_FILENO = 2
        real_stdout = (old_stdout_fileno == STDOUT_FILENO)
        real_stderr = (old_stderr_fileno == STDERR_FILENO)

        if real_stdout:
            hStdout = GetStdHandle(STD_OUTPUT_HANDLE)
            if not_a_console(hStdout):
                real_stdout = False

        if real_stderr:
            hStderr = GetStdHandle(STD_ERROR_HANDLE)
            if not_a_console(hStderr):
                real_stderr = False

        if real_stdout or real_stderr:
            # BOOL WINAPI WriteConsoleW(HANDLE hOutput, LPWSTR lpBuffer, DWORD nChars,
            #                           LPDWORD lpCharsWritten, LPVOID lpReserved);

            WriteConsoleW = WINFUNCTYPE(BOOL, HANDLE, LPWSTR, DWORD, POINTER(DWORD), LPVOID)(("WriteConsoleW", windll.kernel32))

            class UnicodeOutput:
                def __init__(self, hConsole, stream, fileno, name):
                    self._hConsole = hConsole
                    self._stream = stream
                    self._fileno = fileno
                    self.closed = False
                    self.softspace = False
                    self.mode = 'w'
                    self.encoding = 'utf-8'
                    self.name = name
                    self.flush()

                def isatty(self):
                    return False

                def close(self):
                    # don't really close the handle, that would only cause problems
                    self.closed = True

                def fileno(self):
                    return self._fileno

                def flush(self):
                    if self._hConsole is None:
                        try:
                            self._stream.flush()
                        except Exception as e:
                            _complain("%s.flush: %r from %r" % (self.name, e, self._stream))
                            raise

                def write(self, text):
                    try:
                        if self._hConsole is None:
                            if isinstance(text, unicode):
                                text = text.encode('utf-8')
                            self._stream.write(text)
                        else:
                            if not isinstance(text, unicode):
                                text = str(text).decode('utf-8')
                            remaining = len(text)
                            while remaining:
                                n = DWORD(0)
                                # There is a shorter-than-documented limitation on the
                                # length of the string passed to WriteConsoleW (see
                                # <http://tahoe-lafs.org/trac/tahoe-lafs/ticket/1232>.
                                retval = WriteConsoleW(self._hConsole, text, min(remaining, 10000), byref(n), None)
                                if retval == 0 or n.value == 0:
                                    raise IOError("WriteConsoleW returned %r, n.value = %r" % (retval, n.value))
                                remaining -= n.value
                                if not remaining:
                                    break
                                text = text[n.value:]
                    except Exception as e:
                        _complain("%s.write: %r" % (self.name, e))
                        raise

                def writelines(self, lines):
                    try:
                        for line in lines:
                            self.write(line)
                    except Exception as e:
                        _complain("%s.writelines: %r" % (self.name, e))
                        raise

            if real_stdout:
                sys.stdout = UnicodeOutput(hStdout, None, STDOUT_FILENO, '<Unicode console stdout>')
            else:
                sys.stdout = UnicodeOutput(None, sys.stdout, old_stdout_fileno, '<Unicode redirected stdout>')

            if real_stderr:
                sys.stderr = UnicodeOutput(hStderr, None, STDERR_FILENO, '<Unicode console stderr>')
            else:
                sys.stderr = UnicodeOutput(None, sys.stderr, old_stderr_fileno, '<Unicode redirected stderr>')
    except Exception as e:
        _complain("exception %r while fixing up sys.stdout and sys.stderr" % (e,))


    # While we're at it, let's unmangle the command-line arguments:

    # This works around <http://bugs.python.org/issue2128>.
    GetCommandLineW = WINFUNCTYPE(LPWSTR)(("GetCommandLineW", windll.kernel32))
    CommandLineToArgvW = WINFUNCTYPE(POINTER(LPWSTR), LPCWSTR, POINTER(c_int))(("CommandLineToArgvW", windll.shell32))

    argc = c_int(0)
    argv_unicode = CommandLineToArgvW(GetCommandLineW(), byref(argc))

    argv = [argv_unicode[i] for i in xrange(0, argc.value)]

#    argv = [argv_unicode[i].encode('utf-8') for i in xrange(0, argc.value)]

    if not hasattr(sys, 'frozen'):
        # If this is an executable produced by py2exe or bbfreeze, then it will
        # have been invoked directly. Otherwise, unicode_argv[0] is the Python
        # interpreter, so skip that.
        argv = argv[1:]

        # Also skip option arguments to the Python interpreter.
        while len(argv) > 0:
            arg = argv[0]
            if not arg.startswith(u"-") or arg == u"-":
                break
            argv = argv[1:]
            if arg == u'-m':
                # sys.argv[0] should really be the absolute path of the module source,
                # but never mind
                break
            if arg == u'-c':
                argv[0] = u'-c'
                break

    # if you like:
    sys.argv = argv
Community
  • 1
  • 1
Kevin Edwards
  • 360
  • 1
  • 3
  • 6
1

Use a different console program. The following works in mintty, the default terminal emulator in Cygwin.

>>> print u"\u5f15\u8d77\u7684\u6216"
引起的或

There are other console alternatives available for Windows but I have not assessed their Unicode support.

Pete Forman
  • 298
  • 1
  • 10
  • My version of Cygwin runs the `bash` shell directly as a console program, using `cp437`, and doesn't have `mintty` installed at all. – Mark Ransom Oct 08 '14 at 18:14
0

It merely comes from that cmd and powershell consoel do not support variable-width fonts. Fixed fonts do not have Chinese script included. Cygwin is in the same case.
Putty is more advanced, supporting variable-width fonts with cyrillic, vietnamese, arabic scripts, but no chinese so far.

HTH

Mat M
  • 1,786
  • 24
  • 30
-2

Can you try using the program iconv on Windows, and piping your Python output through it? It'd go something like this:

python foo.py | iconv -f utf-8 -t utf-16

You might have to do a little work to get iconv on Windows--it's part of Cygwin but you may be able to build it separately somehow if needed.

John Zwinck
  • 239,568
  • 38
  • 324
  • 436
  • 2
    I'm fairly sure this will just end up as a series of bytes output to the console. The Windows console is not byte-oriented. – Joey Jul 17 '11 at 17:17
  • If that happens, maybe a bespoke Win32 CLI filter program could do it the right way. Like iconv, but written to deal with these quirks by using whatever the "right" output methods are. – John Zwinck Jul 17 '11 at 17:18
  • 4
    If Python inherently only considers byte-wise output instead of characters, then `from win32 import WriteConsole` might help :-) – Joey Jul 17 '11 at 17:20
  • 1
    That was exacly what I needed to run a python script under Cygwin with ``cmd`` and get unicode output, thank you! I ended up with a following command: ``cmd /c "py -3 myscript.py" | iconv -f cp1251 -t utf8`` – a5kin Dec 06 '16 at 12:57
-2

The question is answered in the PrintFails article.

By default, the console in Microsoft Windows only displays 256 characters (cp437, of Code page 437, the original IBM-PC 1981 extended ASCII character set.)

For Russia this means CP866, other countries use their own codepages too. This means that to read Python output in Windows console correctly you should have windows configuration with native codepage configured to display printed symbols.

I suggest you to always print Unicode text without any encoding to ensure maximum compatibility with various platforms.

If you try to print unprintable character you will get UnicodeEncodeError or see distorted text.

In some cases, if Python fails to determine output encoding correctly you might try to set PYTHONIOENCODING environment variable, do note however, that this probably won't work for your example, as your console is unable to present Asian text in current configuration.

To reconfigure console use Control Panel->Language and Regional settings->Advanced(tab)->Non Unicode programs language(section). Note that menu names are translated by me from Russian.

See also answers for the very similar question.

Community
  • 1
  • 1
Basilevs
  • 22,440
  • 15
  • 57
  • 102