Help me understand why Unicode only works sometimes with Python

Question

Here's a little program:

#!/usr/bin/env python
# -*- encoding: utf-8 -*-

print('abcd kΩ ☠ °C √Hz µF ü ☃ ♥')  
print(u'abcd kΩ ☠ °C √Hz µF ü ☃ ♥')

On Ubuntu, Gnome terminal, IPython does what I would expect:

In [6]: run Unicodetest.py
abcd kΩ ☠ °C √Hz µF ü ☃ ♥
abcd kΩ ☠ °C √Hz µF ü ☃ ♥

I get the same output if I enter the commands on trypython.org.

codepad.org, on the other hand, produces an error for the second command:

abcd kΩ ☠ °C √Hz µF ü ☃ ♥
Traceback (most recent call last):
  Line 6, in <module>
    print(u'abcd kΩ ☠ °C √Hz µF ü ☃ ♥')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u03a9' in position 6: ordinal not in range(128)

Contrariwise, IDLE on Windows mangles the output of the first command, but doesn't complain about the second:

>>>
abcd kÎ© â˜  Â°C âˆšHz ÂµF Ã¼ â˜ƒ â™¥
abcd kΩ ☠ °C √Hz µF ü ☃ ♥

IPython in a Windows command prompt or through Python(x,y)'s Console2 version both mangle the first output and complain about the second:

In [9]: run Unicodetest.py
abcd k╬⌐ Γÿá ┬░C ΓêÜHz ┬╡F ├╝ Γÿâ ΓÖÑ
ERROR: An unexpected error occurred while tokenizing input
The following traceback may be corrupted or invalid
The error message is: ('EOF in multi-line statement', (15, 0))

---------------------------------------------------------------------------
UnicodeEncodeError                        Traceback (most recent call last)

Desktop\Unicodetest.py in <module>()
      4 print('abcd k╬⌐ Γÿá ┬░C ΓêÜHz ┬╡F ├╝ Γÿâ ΓÖÑ')
      5
----> 6 print(u'abcd k╬⌐ Γÿá ┬░C ΓêÜHz ┬╡F ├╝ Γÿâ ΓÖÑ')
      7
      8

C:\Python27\lib\encodings\cp437.pyc in encode(self, input, errors)
     10
     11     def encode(self,input,errors='strict'):
---> 12         return codecs.charmap_encode(input,errors,encoding_map)
     13
     14     def decode(self,input,errors='strict'):

UnicodeEncodeError: 'charmap' codec can't encode character u'\u2620' in position 8: character maps to <undefined>
WARNING: Failure executing file: <Unicodetest.py>

IPython inside Python(x,y)'s Spyder does the same, but differently:

In [8]: run Unicodetest.py
abcd kÎ© â˜  Â°C âˆšHz ÂµF Ã¼ â˜ƒ â™¥
------------------------------------------------------------
Traceback (most recent call last):
  File "Unicodetest.py", line 6, in <module>
    print(u'abcd kÎ© â˜  Â°C âˆšHz ÂµF Ã¼ â˜ƒ â™¥')
  File "C:\Python26\lib\encodings\cp1252.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_table)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u03a9' in position 6: character maps to <undefined>

WARNING: Failure executing file: <Unicodetest.py>

(In sitecustomize.py, Spyder sets its own SPYDER_ENCODING based on the locale module's encoding, which is cp1252 for Windows 7.)

What gives? Is one of my commands wrong? Why does one work on some platforms while the other works on other platforms? How do I print Unicode characters consistently without crashing or screwing up?

Is there an alternate terminal for Windows that behaves like the one in Ubuntu? It seems that TCC-LE, Console2, Git Bash, PyCmd, etc. are all just wrappers for cmd.exe rather than replacements. Is there a way to run IPython inside the interface that IDLE uses?

In IPython unicode is unfortunately broken. We should have it fixed for the next version, 0.11, so it behaves like typing at a raw Python interpreter. — Thomas K, Apr 18 '11 at 23:00
check [this](http://stackoverflow.com/q/39528462/5284370) out. — Soorena, Sep 18 '16 at 13:21

score 12 · Accepted Answer · edited May 23 '17 at 12:09

I/O in Python (and most other languages) is based on bytes. When you write a byte string (str in 2.x, bytes in 3.x) to a file, the bytes are simply written as-is. When you write a Unicode string (unicode in 2.x, str in 3.x) to a file, the data needs to be encoded to a byte sequence.

For a further explanation of this distinction see the Dive into Python 3 chapter on strings.

print('abcd kΩ ☠ °C √Hz µF ü ☃ ♥')

Here, the string is a byte string. Because the encoding of your source file is UTF-8, the bytes are

'abcd k\xce\xa9 \xe2\x98\xa0 \xc2\xb0C \xe2\x88\x9aHz \xc2\xb5F \xc3\xbc \xe2\x98\x83 \xe2\x99\xa5'

The print statement writes these bytes to the console as-is. But the Windows console interprets byte strings as being encoded in the "OEM" code page, which in the US is 437. So the string you actually see on your screen is

abcd k╬⌐ Γÿá ┬░C ΓêÜHz ┬╡F ├╝ Γÿâ ΓÖÑ

On your Ubuntu system, this doesn't cause a problem because there the default console encoding is UTF-8, so you don't have the discrepancy between source file encoding and console encoding.

print(u'abcd kΩ ☠ °C √Hz µF ü ☃ ♥')

When printing a Unicode string, the string has to get encoded into bytes. But it only works if you have an encoding that supports those characters. And you don't.

The default IBM437 encoding lacks the characters ☠☃♥
The windows-1252 encoding used by Spyder lacks the characters Ω☠√☃♥.

So, in both cases, you get a UnicodeEncodeError trying to print the string.

What gives?

Windows and Linux took vastly different approaches to supporting Unicode.

Originally, they both worked pretty much the same way: Each locale has its own language-specific char-based encoding (the "ANSI code page" in Windows). Western languages used ISO-8859-1 or windows-1252, Russian used KOI8-R or windows-1251, etc.

When Windows NT added support for Unicode (int the early days when it was assumed that Unicode would use 16-bit characters), it did so by creating a parallel version of its API that used wchar_t instead of char. For example, the MessageBox function was split into the two functions:

int MessageBoxA(HWND hWnd, const char* lpText, const char* lpCaption, unsigned int uType);
int MessageBoxW(HWND hWnd, const wchar_t* lpText, const wchar_t* lpCaption, unsigned int uType);

The "W" functions are the "real" ones. The "A" functions exist for backwards compatibility with DOS-based Windows and mostly just convert their string arguments to UTF-16 and then call the corresponding "W" function.

In the Unix world (specifically, Plan 9), writing a whole new version of the POSIX API was seen as impractical, so Unicode support was approached in a different manner. The existing support for multi-byte encoding in CJK locales was used to implement a new encoding now known as UTF-8.

The preference towards UTF-8 on Unix-like systems and UTF-16 on Windows is a huge pain the the ass when writing cross-platform code that supports Unicode. Python tries to hide this from the programmer, but printing to the console is one of Joel's "leaky abstractions".

That's very helpful, thanks. I still want to know if there's a way to make "print" work in IPython in Windows, whether in the built-in Windows console or in some other third-party console (if such a thing exists). If it's not possible to display the special characters, I'd at least like to print "?" or something without crashing. — endolith, Apr 18 '11 at 21:08
@christian: Yes, Notepad++ can save in UTF-8, but that doesn't appear to be the issue here. The problem is that the encoding of the file does not match the encoding of stdout. — dan04, Apr 18 '11 at 21:58
If a module is outputting a string like `u'G\xc3\xb6teborg, Sweden'`, isn't this incorrect? It should be either `u'G\xf6teborg, Sweden'`, or, after encoding to UTF-8, `'G\xc3\xb6teborg, Sweden'` without the `u`. — endolith, Apr 24 '11 at 00:21
I believe it is, and the solution is `u'G\xc3\xb6teborg, Sweden'.encode('raw_unicode_escape')` → `'G\xc3\xb6teborg, Sweden'` — endolith, Apr 25 '11 at 14:43

score 2 · Answer 2 · answered Apr 17 '11 at 18:24

There are two possible reasons:

Encoding of Unicode by print. You cannot output raw Unicode, so print needs to figure out how to convert it to the byte stream expected by the console (it uses sys.stdout.encoding AFAIK), which brings us to
Console support. Python does not control your terminal, so if it spits out UTF-8 while your terminal expects something else, you'll get mangled output.

score 0 · Answer 3 · answered Apr 17 '11 at 18:24

Your problem here is that your program expects, and outputs, UTF-8 characters, but consoles and various python runners on the web use other code pages. There is no way to code special characters that work in all encodings without modification. However, if you choose to use UTF-8 everywhere, you should be safe.

I think any terminal in Windows will do - so don't bother switching out the default one (cmd.exe) just because of this. Instead, change the encoding of the terminal to be UTF-8 as well, to match the encoding of your python script.

Unfortunately, I've never been able to find a way to set the code page to UTF-8 as default, so it has to be done every time you open a new command prompt. But it's done via a simple command, so it's only half-bad... You change the encoding by switching codepage:

>chcp 65001
Current codepage is now 65001

Note that you have to use one of the standard fonts for this to work. Most sources on the web seem to suggest Lucida Console.

Now every command I try fails with `LookupError: unknown encoding: cp65001` due to `line = raw_input_original(prompt).decode(self.stdin_encoding)` in `C:\Python27\lib\site-packages\IPython\iplib.pyc` — endolith, Apr 17 '11 at 18:32
There are, unfortunately, many problems with `chcp 65001`. The Microsoft C runtime and the default Windows console are designed to work with locale-specific code pages; when everyone else is moving to UTF-8-for-everything this is a real shame. — bobince, Apr 18 '11 at 20:30

score 0 · Answer 4 · answered Apr 17 '11 at 18:41

0

Unicode output from Python to the Windows console just doesn't work. Python can't be persuaded to emit the native Windows encoding which expects wide characters and UCS2.

answered Apr 17 '11 at 18:41

David Heffernan

601,492
42
1,072
1,490

2

I'm delighted to be down voted here since it means I am wrong and will finally be able to get good unicode support in a Windows console. Now I'm just waiting for the details of how to do that. – David Heffernan Apr 17 '11 at 19:03
1

Well... you can't even ‘just output UCS-2’ with the standard C runtime, it always uses a locale-specific ASCII-superset codepage (never a UTF of any kind). There is a separate Win32-specific interface which can be used to output Unicode content, `WriteConsoleW`, but then you have to decide whether outputting bytes or characters is what you mean to do, which might depend on the platform, or whether your IO streams are being redirected to file. It's all a bit of a mess, this. – bobince Apr 18 '11 at 20:35
@bobince it turns out that's a myth as exposed by Michael Kaplan: http://blogs.msdn.com/b/michkap/archive/2008/03/18/8306597.aspx Sing ho for `_O_U16TEXT`! – David Heffernan Apr 18 '11 at 20:38
The key phrase is **standard** C runtime. Kaplan's example uses Windows-specific functions. – dan04 Apr 18 '11 at 21:49
@dan04 scroll down to the bottom and you'll see what I mean. Also there is no standard C runtime. You mean the MS C runtime. – David Heffernan Apr 18 '11 at 21:52

Christian · Answer 5 · 2011-04-28T13:29:58.867

0

@dan04: You are right that the problem is that the encoding of the file does not match the encoding of stdout. Nevertheless one way to solve the problem is to change the encoding of the file. So on Windows Notepad++ can used to save the code with UTF-8 character encoding.

An alternative is GNU recode.

edited Apr 28 '11 at 13:29

answered Apr 19 '11 at 09:51

Christian

1,017
3
14
30

Help me understand why Unicode only works sometimes with Python

5 Answers5

Linked