12

I'm on the CMD in Windows 8 and I've set the codepage to 65001 (chcp 65001). I'm using Python 2.7.2 (ActivePython 2.7.2.5) and I've set the PYTHONSTARTUP environment variable to "bootstrap.py".

bootstrap.py:

import codecs
codecs.register(
    lambda name: name == 'cp65001' and codecs.lookup('UTF-8') or None
)

This lets me print ASCII:

>>> print 'hello'
hello
>>> print u'hello'
hello

But the errors I get when I try to print a Unicode string with non-ASCII characters makes no sense to me. Here I try to print a few strings containing Nordic symbols (I added the extra line break between the prints for readability):

>>> print u'æøå'
��øåTraceback (most recent call last):
  File "<stdin>", line 1, in <module>
IOError: [Errno 2] No such file or directory

>>> print u'åndalsnes'
��ndalsnes

>>> print u'åndalsnesæ'
��ndalsnesæTraceback (most recent call last):
  File "<stdin>", line 1, in <module>
IOError: [Errno 22] Invalid argument

>>> print u'Øst'
��st

>>> print u'uØst'
uØstTraceback (most recent call last):
  File "<stdin>", line 1, in <module>
IOError: [Errno 22] Invalid argument

>>> print u'ØstÆØÅæøå'
��stÆØÅæøåTraceback (most recent call last):
  File "<stdin>", line 1, in <module>
IOError: [Errno 22] Invalid argument

>>> print u'_ØstÆØÅæøå'
_ØstÆØÅæøåTraceback (most recent call last):
  File "<stdin>", line 1, in <module>
IOError: [Errno 22] Invalid argument

As you see it doesn't always raise an error (and doesn't even raise the same error every time), and the Nordic symbols is only displayed correctly occasionally.

Can somebody explain this behavior, or at least help me figure out how to print Unicode to the CMD correctly?

Nilesh
  • 20,521
  • 16
  • 92
  • 148
Hubro
  • 56,214
  • 69
  • 228
  • 381
  • 2
    This is a nightmare situation. And it's been discussed a gazillion times here on SO and elsewhere. For example: http://www.google.com/search?q=print+unicode+windows+console+python – David Heffernan Nov 19 '12 at 12:17
  • The simplest solution is to use Python 3.3, if you can. It has a [cp65001 codec](http://docs.python.org/3/whatsnew/3.3.html#codecs). – Eryk Sun Nov 19 '12 at 15:31
  • @PiotrDobrogost: Please refer me to another case like this if you can find it (and I **don't** mean Unicode decode errors!) – Hubro Nov 19 '12 at 18:04
  • 2
    @DavidHeffernan: I've had a look through the search results and the closest thing I can find to a canonical answer is what the OP is already doing. It seems to me that either this is a new variant or the question has never really been properly answered? – Harry Johnston Nov 19 '12 at 20:44
  • @HarryJohnston Personally, I've never found a satisfactory solution. – David Heffernan Nov 19 '12 at 20:48
  • 1
    At least there's improved support for Windows code pages in 3.3: [PyUnicode_EncodeCodePage](http://docs.python.org/3/c-api/unicode.html#PyUnicode_EncodeCodePage). The latter is used by `codecs.code_page_encode`, which the new cp65001 codec uses to define `encode = functools.partial(codecs.code_page_encode, 65001)`, and similar for decoding. – Eryk Sun Nov 19 '12 at 22:03
  • @eryksun: Any idea if that's portable to 2.7? – Hubro Nov 20 '12 at 14:26
  • 2
    Currently the `PRINT_ITEM` op calls `PyFile_WriteObject`, which calls `PyObject_Print`, which eventually calls `PyString_Type.tp_print`, which writes to stdout using libc `fwrite`. At issue is a bug that causes the stdout `FILE` stream to have its error flag set, even though no error has occurred (hence the random 'errors' reported) because `write` returns the number of characters written instead of the number of bytes. You can verify this by using `os.write(sys.stdout.fileno(), s)`, where `s` is a non-ASCII UTF-8 string. – Eryk Sun Nov 20 '12 at 21:30
  • 2
    This isn't an issue in Python 3 since it implements its own buffering (`_io.BufferedWriter`), and the underlying `_io.FileIO` does a low-level `write` to the target file descriptor. – Eryk Sun Nov 20 '12 at 21:39
  • See [this answer](http://stackoverflow.com/q/878972/205580) for some workarounds. tzot's answer seems simple enough. – Eryk Sun Nov 20 '12 at 21:43

1 Answers1

1

Try This :

# -*- coding: utf-8 -*-
    from __future__ import unicode_literals
    print u'æøå'

Making use of from __future__ import unicode_literals would be useful in an interactive python session.

It is certainly possible to write Unicode to the console successfully using WriteConsoleW. This works regardless of the console code page, including 65001. The code here does so (it's for Python 2.x, but you'd be calling WriteConsoleW from C anyway).

WriteConsoleW has one bug that I know of, which is that it fails when writing more than 26608 characters at once. That's easy to work around by limiting the amount of data passed in a single call.

Fonts are not Python's problem, but encoding is. It doesn't make sense to fail to output the right characters just because some users might not have selected fonts that can display those characters. This bug should be reopened.

(For completeness, it is possible to display Unicode on the console using fonts other than Lucida Console and Consolas, but it requires a registry hack.) I hope it helps.

Community
  • 1
  • 1
Soheil__K
  • 642
  • 1
  • 8
  • 17
  • I believe WriteConsoleW is limited to UCS-2, i.e., you can't use characters from the supplementary planes. But in most cases this shouldn't be a problem. – Harry Johnston Jul 09 '14 at 20:09