32

I'm trying to print utf-8 card symbols (♠,♥,♦︎︎,♣) from a python module to a windows console. The console that I'm using is git bash and I'm using console2 as a front-end. I've tried/read a number of approaches below and nothing has worked so far.

  • Made sure the console can handle utf-8 characters. These two tests make me believe that the console isn't the problem.

    enter image description here

  • Attempt the same thing from the python module.
    When I execute the .py, this is the result.

     print(u'♠')
     UnicodeEncodeError: 'charmap' codec can't encode character '\u2660' in position 0: character maps to <undefined>
    
  • Attempt to encode ♠. This gives me back the unicode set encoded in utf-8, but still no spade symbol.

     text = '♠'
     print(text.encode('utf-8'))
     b'\xe2\x99\xa0'
    

I feel like I'm missing a step or not understanding the whole encode/decode process. I've read this, this, and this. The last of the pages suggests wrapping the sys.stdout into the code but this article says using stdout is unnecessary and points to another page using the codecs module.

jonrsharpe
  • 115,751
  • 26
  • 228
  • 437
Austin A
  • 2,990
  • 6
  • 27
  • 42
  • 3
    How are you running the .py? Have you tried setting the `PYTHONIOENCODING` environment variable? – nneonneo Aug 04 '14 at 21:26
  • https://wiki.python.org/moin/PrintFails – Jason S Aug 04 '14 at 21:27
  • 1
    just add `#encoding: utf-8` at top of your `.py` – Mazdak Aug 04 '14 at 21:32
  • This question covers a number of possible solutions: http://stackoverflow.com/questions/4374455/how-to-set-sys-stdout-encoding-in-python-3 – Ross Ridge Aug 04 '14 at 23:18
  • Thanks everyone. @nneonneo, I'm executing my .py's from the console. I run the py then open an interactive shell by using this line 'py -3.4 -i myfile.py'. I would also like to avoid making heavy changes like manipulating the 'PYTHONIOENCODING'. But the more I read, the more I realize that my problem lies in the default encoding of the windows console (cp437). – Austin A Aug 04 '14 at 23:28
  • 1
    @Jason S, I've definitely come across this article a number of times but I'm still trying to make sense of it all. It's possible the answer lies within it. – Austin A Aug 04 '14 at 23:28
  • @Kasra, the way i understand [python's source encoding](http://legacy.python.org/dev/peps/pep-0263/) is that it tells the OS executing the file what encoding the file should have. What confuses me here is this. In an interactive python session, when I run `print(sys.stdout.encoding)` it outputs cp437, which makes sense since that's windows default encoding. However, if I add you suggestion (or any other source encoding) and place that same line in the .py, it still outputs cp437. I'm not sure if this is normal or if it's possible that git bash/windows isn't recognizing my source encoding. – Austin A Aug 05 '14 at 03:11

6 Answers6

18

Since Python 3.7.x, You can reconfigure stdout :

import sys
sys.stdout.reconfigure(encoding='utf-8')
Bensuperpc
  • 1,275
  • 1
  • 14
  • 21
15

What I'm trying to do is print utf-8 card symbols (♠,♥,♦,♣) from a python module to a windows console

UTF-8 is a byte encoding of Unicode characters. ♠♥♦♣ are Unicode characters which can be reproduced in a variety of encodings and UTF-8 is one of those encodings—as a UTF, UTF-8 can reproduce any Unicode character. But there is nothing specifically “UTF-8” about those characters.

Other encodings that can reproduce the characters ♠♥♦♣ are Windows code page 850 and 437, which your console is likely to be using under a Western European install of Windows. You can print ♠ in these encodings but you are not using UTF-8 to do so, and you won't be able to use other Unicode characters that are available in UTF-8 but outside the scope of these code pages.

print(u'♠')
UnicodeEncodeError: 'charmap' codec can't encode character '\u2660'

In Python 3 this is the same as the print('♠') test you did above, so there is something different about how you are invoking the script containing this print, compared to your py -3.4. What does sys.stdout.encoding give you from the script?

To get print working correctly you would have to make sure Python picks up the right encoding. If it is not doing that adequately from the terminal settings you would indeed have to set PYTHONIOENCODING to cp437.

>>> text = '♠'
>>> print(text.encode('utf-8'))
b'\xe2\x99\xa0'

print can only print Unicode strings. For other types including the bytes string that results from the encode() method, it gets the literal representation (repr) of the object. b'\xe2\x99\xa0' is how you would write a Python 3 bytes literal containing a UTF-8 encoded ♠.

If what you want to do is bypass print's implicit encoding to PYTHONIOENCODING and substitute your own, you can do that explicitly:

>>> import sys
>>> sys.stdout.buffer.write('♠'.encode('cp437'))

This will of course generate wrong output for any consoles not running code page 437 (eg non-Western-European installs). Generally, for apps using the C stdio, like Python does, getting non-ASCII characters to the Windows console is just too unreliable to bother with.

bobince
  • 528,062
  • 107
  • 651
  • 834
  • Thank you for the final line: "Generally, for apps using the C stdio, like Python does, getting non-ASCII characters to the Windows console is just too unreliable to bother with." – michelek May 16 '17 at 08:40
3

Do not encode to utf-8; print Unicode directly instead:

print(u'♠')

See how to print Unicode to Windows console.

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • Python 3.6 print('♠') (strings are UTF8 by default ) – JinSnow Jan 10 '17 at 20:25
  • @Guillaume wrong. Do not confuse text represented as Unicode strings and text represented as bytes using utf-8 encoding. Pep 528 and 529 has nothing to do with it. – jfs Jan 10 '17 at 21:17
  • thanks for your correction! Do you mean that the default mode of python 3.6 is unicode (and not UTF8) right? – JinSnow Jan 11 '17 at 09:21
  • 1
    @Guillaume u'♠' == '♠' in Python 3.3+ or if `from __future__ import unicode_literals` is used. – jfs Jan 11 '17 at 11:11
  • for the (other) rookies: PEP 528 -- Change Windows console encoding to UTF-8 // PEP 529 -- Change Windows filesystem encoding to UTF-8 – JinSnow Jan 11 '17 at 11:36
  • @Guillaume: to be crystal clear: utf-8 is NOT synonym for Unicode. It is just one of *many* character encodings. The fact that utf-8 is mentioned in both the question and the peps is a *coincidence* e.g., `UnicodeEncodeError` may be fixed by installing `win-unicode-console` package that does not use utf-8 anywhere (it works with Unicode strings directly). Follow the link in the answer. – jfs Jan 11 '17 at 12:00
  • @sebastian thanks to your patience I finally understood that point. I still don't understand why I can't print "ç" (french letter) in the console (U+00E7) but If I understood well, it can't be fixed (without risking to get some weird bugs). Do you confirm? (I'm running python 3.6) – JinSnow Jan 11 '17 at 21:08
  • @Guillaume no. You should be able to print any Unicode character (and even to display it correctly if you've configured the font (BMP-only)) Have you read [the linked answer](http://stackoverflow.com/a/32176732/4279)? If it doesn't help; create a minimal code example such as `print('\xe7')` and post it as a new Stack Overflow question with the full traceback. – jfs Jan 11 '17 at 21:42
  • Python 3.6: "the default console on Windows accept all Unicode characters with that version" (well, most of it for me) **BUT** you need to configure the console: right click on the top of the windows (of the cmd or the python IDLE), in default/font choose the "Lucida console". – JinSnow Jan 13 '17 at 20:49
  • @Guillaume: if you click the only link in the answer then you should see the more detail answer. [The same comment applies (to address your comment)](http://stackoverflow.com/questions/30539882/whats-the-deal-with-python-3-4-unicode-different-languages-and-windows/30551552#comment70486609_30551552). – jfs Jan 13 '17 at 21:18
-1

You can look at it this way. A string is a sequence of characters, not a sequence of bytes. Characters are Unicode codepoints. Bytes are just numbers in range 0–255. At the low level, computers work just with sequences of bytes. If you want to a print a string, you just call print(a_string) in Python. But to communicate with the OS environment, the string has to be encoded to a sequence of bytes. This is done automatically somewhere under the hoods of print function. The encoding used is sys.stdout.encoding. If you get an UnicodeEncodeError, it means that your characters cannot be encoded using the current encoding.

As far as I know, it is currently not possible to run Python on Windows in a way that that the encoding used is capable of encoding every character (as UTF-8 or UTF-16) and both assumed by Python and really used by the OS environment for both input and output. There is a workaround – you can use win_unicode_console package, which aims to solve this issue. Just install it by pip install win_unicode_console, and in your sitecustomize import it and call win_unicode_console.enable(). This will serve as an external patch to your Python installation ragarding this issue. See the documentation for more information: https://github.com/Drekin/win-unicode-console.

user87690
  • 687
  • 3
  • 25
  • It is not how it works. As `win_unicode_console` demonstrates you can *write* any Unicode character (though only BMP characters will be display by Windows console). – jfs Aug 08 '15 at 18:36
  • @J.F.Sebastian: What do you mean? With what I say you don't agree? – user87690 Aug 08 '15 at 18:46
  • your claim is essentially that all OS I/O interfaces are bytes based. `WriteConsoleW()` (used by `win_unicode_console`) is a counter-example – jfs Aug 08 '15 at 18:48
  • @J.F.Sebastian: I view `WriteConsoleW` as accepting a UTF-16-LE encoded bytes, not a string. – user87690 Aug 08 '15 at 18:52
  • it is incorrect. You should separate the abstraction e.g., `unicode` type in Python 2 and its implementation UCS-2, UCS-4 (narrow, wide builds). The implementation can be improved e.g,. Python 3 uses flexible string representation but the abstraction stays the same. In particular, `WriteConsoleW()` might have started as UCS-2 but it is utf-16le now. Windows console itself is still UCS-2 i.e., you can write (and copy/paste) utf-16le but only BMP characters can be displayed (even the font support corresponding astral characters) – jfs Aug 08 '15 at 19:02
  • @J.F.Sebastian: That's what I'm doing – saying that a string is an abstraction, represented by `unicode` type in Python 2, while when communicating with the OS environment, you cannot use this abstraction directly, you have to encode it somehow. Of course, the implementation of the `unicode` also uses come internal encoding. – user87690 Aug 08 '15 at 19:13
  • Again, don't mix the abstraction and how it happens to be implemented. You can use Unicode string type directly. Compare how `os.listdir(unicode_string)` works on Unix (OS uses bytes interfaces) and on Windows (Unicode API). One of the improvements of Python 3 is that it uses Unicode API more on Windows. – jfs Aug 08 '15 at 19:34
  • @J.F.Sebastian: By “you cannot use string directly” I meant the fact that when you want to communicate with the OS environment, you are leaving the Python realm, so you cannot use Python abstractions. So some kind of encoding is needed under the hoods. Using this obvious truth I wanted to explain what is the encoding process good for. By no means I meant that you should encode before calling `print` or `os.listdir`. In fact, I meant the opposite. As I said “the encoding is done under the hoods”, “you should just call `print(a_string)`”, which is exactly what you suggest. – user87690 Aug 08 '15 at 19:54
  • **wrong**. Unicode is not Python abstraction. Maybe an example would help: let's take the number `1.5`. We can store it in memory and on disk using [binary32 format](https://en.wikipedia.org/wiki/Single-precision_floating-point_format): `b'\x00\x00\xc0\x3f'`. If we apply your logic then there are no numeric interfaces there are only bytes interfaces. Integers are encoded as bytes too. Do you consider `pid_t getpid()` to be bytes interface? Byte itself is an abstraction though it is so successful and common that is indistinguishable from reality for some people. – jfs Aug 08 '15 at 20:53
  • @J.F.Sebastian: I don't claim that Unicode is a Python abstraction. I also don't claim that there are only bytes interfaces. Well, it actually depeneds on what we exactly mean. Yes, integers are encoded as bytes too. And calling a C function e.g. via ctypes needs encoding of Python integers, which are a base for abstract intergers, into bytes (of particular width and endianess), which are a level lower. – user87690 Aug 08 '15 at 21:24
  • @J.F.Sebastian: Note that none of your objections is against anything I recommend to actually do in my answer. I think there is just some misunderstanding between us. Maybe we should contiue with the discussion somewhere else. – user87690 Aug 08 '15 at 21:25
  • these messages are for your benefit. If you don't want to learn. It is fine by me. – jfs Aug 08 '15 at 21:51
  • @J.F.Sebastian: I was addressing the fact that extended discussions should be avoided in comments. And suggesting to move our discussion somewhere else. – user87690 Aug 09 '15 at 08:07
  • If you ever meet in person - wish I could be present :) – michelek May 16 '17 at 09:01
-1

I met the same problem with python 3.6. However, I address this problem by using python 3.7. So, you just update python version.

-1

Late answer, but this error is related to your terminal's encoding not supporting certain characters.
I fixed it on python3 using:

import sys
import io

sys.stdout = io.open(sys.stdout.fileno(), 'w', encoding='utf8')
print("é, à, ...")
Pedro Lobito
  • 94,083
  • 31
  • 258
  • 268