3

Possible Duplicate:
Python, Unicode, and the Windows console

I have a folder with a filename "01 - ナナナン塊.txt"

I open python at the interactive prompt in the same folder as the file and attempt to walk the folder hierachy:

Python 3.1.2 (r312:79149, Mar 21 2010, 00:41:52) [MSC v.1500 32 bit (Intel)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> for x in os.walk('.'):
...     print(x)
...
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "C:\dev\Python31\lib\encodings\cp850.py", line 19, in encode
    return codecs.charmap_encode(input,self.errors,encoding_map)[0]
UnicodeEncodeError: 'charmap' codec can't encode characters in position 17-21: character maps to <undefined>

Clearly the encoding I'm using isn't able to deal with Japanese characters. Fine. But Python 3.1 is meant to be unicode all the way down, as I understand it, so I'm at a loss as to what I'm meant to do with this. Anyone have any ideas?

Community
  • 1
  • 1
Tom Whittock
  • 4,081
  • 19
  • 24
  • 1
    See http://stackoverflow.com/questions/5419/python-unicode-and-the-windows-console - and ultimately, see: http://wiki.python.org/moin/PrintFails - I think that's what you're looking for. – Thanatos Sep 24 '10 at 18:44
  • Thanatos is correct - it's the print that's failing. I'm sad. I thought Python was easy to use :( – Tom Whittock Sep 24 '10 at 19:01
  • It turns out that the problem is that it's nothing to do with files - unicode support in Python 3 on Windows is a bit patchy - print doesn't work in the Console, and files are opened in non-utf mode (this was the other method I tried before posting here) so I was seemingly without options to dump out what I was walking over. In addition to the accepted answer, I could also have jumped through the codecs.open hoop to create a file which represents the default text type in Python and looked at that. How unpythonic. – Tom Whittock Sep 24 '10 at 22:30

2 Answers2

7

It seems like all answers so far are from Unix people who assume the Windows console is like a Unix terminal, which it is not.

The problem is that you can't write Unicode output to the Windows console using the normal underlying file I/O functions. The Windows API WriteConsole needs to be used. Python should probably be doing this transparently, but it isn't.

There's a different problem if you redirect the output to a file: Windows text files are historically in the ANSI codepage, not Unicode. You can fairly safely write UTF-8 to text files in Windows these days, but Python doesn't do that by default.

I think it should do these things, but here's some code to make it happen. You don't have to worry about the details if you don't want to; just call ConsoleFile.wrap_standard_handles(). You do need PyWin installed to get access to the necessary APIs.

import os, sys, io, win32api, win32console, pywintypes

def change_file_encoding(f, encoding):
    """
    TextIOWrapper is missing a way to change the file encoding, so we have to
    do it by creating a new one.
    """

    errors = f.errors
    line_buffering = f.line_buffering
    # f.newlines is not the same as the newline parameter to TextIOWrapper.
    # newlines = f.newlines

    buf = f.detach()

    # TextIOWrapper defaults newline to \r\n on Windows, even though the underlying
    # file object is already doing that for us.  We need to explicitly say "\n" to
    # make sure we don't output \r\r\n; this is the same as the internal function
    # create_stdio.
    return io.TextIOWrapper(buf, encoding, errors, "\n", line_buffering)


class ConsoleFile:
    class FileNotConsole(Exception): pass

    def __init__(self, handle):
        handle = win32api.GetStdHandle(handle)
        self.screen = win32console.PyConsoleScreenBufferType(handle)
        try:
            self.screen.GetConsoleMode()
        except pywintypes.error as e:
            raise ConsoleFile.FileNotConsole

    def write(self, s):
        self.screen.WriteConsole(s)

    def close(self): pass
    def flush(self): pass
    def isatty(self): return True

    @staticmethod
    def wrap_standard_handles():
        sys.stdout.flush()
        try:
            # There seems to be no binding for _get_osfhandle.
            sys.stdout = ConsoleFile(win32api.STD_OUTPUT_HANDLE)
        except ConsoleFile.FileNotConsole:
            sys.stdout = change_file_encoding(sys.stdout, "utf-8")

        sys.stderr.flush()
        try:
            sys.stderr = ConsoleFile(win32api.STD_ERROR_HANDLE)
        except ConsoleFile.FileNotConsole:
            sys.stderr = change_file_encoding(sys.stderr, "utf-8")

ConsoleFile.wrap_standard_handles()

print("English 漢字 Кири́ллица")

This is a little tricky: if stdout or stderr is the console, we need to output with WriteConsole; but if it's not (eg. foo.py > file), that's not going to work, and we need to change the file's encoding to UTF-8 instead.

The opposite in either case will not work. You can't output to a regular file with WriteConsole (it's not actually a byte API, but a UTF-16 one; PyWin hides this detail), and you can't write UTF-8 to a Windows console.

Also, it really should be using _get_osfhandle to get the handle to stdout and stderr, rather than assuming they're assigned to the standard handles, but that API doesn't seem to have any PyWin binding.

Glenn Maynard
  • 55,829
  • 10
  • 121
  • 131
  • +1 – you seem to be the first to actually understand the problem. I think the problem with `WriteConsoleW` vs. `WriteFile` is known in the Python community, but actually implementing the distinction seems to be difficult or at least unpopular. – Philipp Sep 24 '10 at 21:44
  • Python is developed largely by Unix people, and spending time on the odd details of other peoples' platforms is never appealing--but this really is important. Major parts of Python in Windows (eg. `print`) should *not* be limited to '95-era (actually, these date back to DOS) ANSI codepages. – Glenn Maynard Sep 24 '10 at 21:49
  • Wow. This is what I need to do to display a unicode string in the standard command window in Windows. If it wasn't so sad, it would be funny. Thank you very much for doing all that hard work of implementing the output streams properly. – Tom Whittock Sep 24 '10 at 22:22
  • Fortunately Python seems to be less Linux-centric than many other OSS projects: the developers are actively working towards better Windows support and accept that Windows is an important platform and not the devil himself. If somebody submitted a patch to switch console output to `WriteConsoleW` it would have a high chance of being integrated. – Philipp Sep 24 '10 at 23:30
  • @Tom: consider yourself lucky that Python can even cope with Unicode filenames. Try this with something like PHP or Ruby and you wouldn't even be able to open the file. It's hugely unfortunate that the MS C runtime (on which Python and other languages are built) insists on using the system default codepage for stdio byte interfaces instead of UTF-8. – bobince Sep 25 '10 at 13:47
  • @bobince: It does that for compatibility, which is something--let's be honest--Windows is far better at than Linux, which in general doesn't care about backwards compatibility beyond maybe a year or so at all. (Try building a binary for a Linux system that's five years old.) That said, it'd help a lot if Windows had an API call to change the ACP to UTF-8; one gets the sense that they don't do *that* on purpose, just to make the lives of non-Windows-centric programmers harder... – Glenn Maynard Sep 25 '10 at 20:16
-2

For hard-coded strings, you'll need to specify the encoding at the top of source files. For bytestrings input from some other source - such as os.walk -, you need to specify the byte string's encoding (see unutbu's answer).

André Caron
  • 44,541
  • 12
  • 67
  • 125
  • There are no byte strings in Windows, only UTF-16 strings. – Philipp Sep 24 '10 at 21:44
  • @Philipp: All Windows-NT based kernel know only UTF-16 strings. You can still invoke ANSI version of all Win32 API, such as `FindFirstFileA()` to get a fodler listing containing what Python calls bytestrings. I assume this is what Python does because on my Windows machine, `os.walk()` with Python 2.6.5 returns items of class `str`, which are byte strings. – André Caron Sep 24 '10 at 22:11
  • I'm using Python 3 which is entirely utf-8.http://www.python.org/dev/peps/pep-3120/ – Tom Whittock Sep 24 '10 at 22:21
  • Strings in Python 3 are either UTF-16 or UTF-32, but not UTF-8. – Philipp Sep 24 '10 at 23:15
  • @Philipp: sorry, i was responding to the source file encoding thing, should have made that clearer – Tom Whittock Sep 24 '10 at 23:23