103

I run this snippet twice, in the Ubuntu terminal (encoding set to utf-8), once with ./test.py and then with ./test.py >out.txt:

uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni

Without redirection it prints garbage. With redirection I get a UnicodeDecodeError. Can someone explain why I get the error only in the second case, or even better give a detailed explanation of what's going on behind the curtain in both cases?

Eric O. Lebigot
  • 91,433
  • 48
  • 218
  • 260
zedoo
  • 10,562
  • 12
  • 44
  • 55
  • [This](http://stackoverflow.com/questions/368805/python-unicodedecodeerror-am-i-misunderstanding-encode/370199#370199http:) answer might be of help too. – tzot Jan 22 '11 at 12:33
  • When I try to replicate your finding, I get a UnicodeEncodeError, not a UnicodeDecodeError. https://gist.github.com/jaraco/12abfc05872c65a4f3f6cd58b6f9be4d – Jason R. Coombs Jan 24 '17 at 16:47

3 Answers3

257

The whole key to such encoding problems is to understand that there are in principle two distinct concepts of "string": (1) string of characters, and (2) string/array of bytes. This distinction has been mostly ignored for a long time because of the historic ubiquity of encodings with no more than 256 characters (ASCII, Latin-1, Windows-1252, Mac OS Roman,…): these encodings map a set of common characters to numbers between 0 and 255 (i.e. bytes); the relatively limited exchange of files before the advent of the web made this situation of incompatible encodings tolerable, as most programs could ignore the fact that there were multiple encodings as long as they produced text that remained on the same operating system: such programs would simply treat text as bytes (through the encoding used by the operating system). The correct, modern view properly separates these two string concepts, based on the following two points:

  1. Characters are mostly unrelated to computers: one can draw them on a chalk board, etc., like for instance بايثون, 中蟒 and . "Characters" for machines also include "drawing instructions" like for example spaces, carriage return, instructions to set the writing direction (for Arabic, etc.), accents, etc. A very large character list is included in the Unicode standard; it covers most of the known characters.

  2. On the other hand, computers do need to represent abstract characters in some way: for this, they use arrays of bytes (numbers between 0 and 255 included), because their memory comes in byte chunks. The necessary process that converts characters to bytes is called encoding. Thus, a computer requires an encoding in order to represent characters. Any text present on your computer is encoded (until it is displayed), whether it be sent to a terminal (which expects characters encoded in a specific way), or saved in a file. In order to be displayed or properly "understood" (by, say, the Python interpreter), streams of bytes are decoded into characters. A few encodings (UTF-8, UTF-16,…) are defined by Unicode for its list of characters (Unicode thus defines both a list of characters and encodings for these characters—there are still places where one sees the expression "Unicode encoding" as a way to refer to the ubiquitous UTF-8, but this is incorrect terminology, as Unicode provides multiple encodings).

In summary, computers need to internally represent characters with bytes, and they do so through two operations:

Encoding: characters → bytes

Decoding: bytes → characters

Some encodings cannot encode all characters (e.g., ASCII), while (some) Unicode encodings allow you to encode all Unicode characters. The encoding is also not necessarily unique, because some characters can be represented either directly or as a combination (e.g. of a base character and of accents).

Note that the concept of newline adds a layer of complication, since it can be represented by different (control) characters that depend on the operating system (this is the reason for Python's universal newline file reading mode).


Some more information on Unicode, characters and code points, if you are interested:

Now, what I have called "character" above is what Unicode calls a "user-perceived character". A single user-perceived character can sometimes be represented in Unicode by combining character parts (base character, accents,…) found at different indexes in the Unicode list, which are called "code points"—these codes points can be combined together to form a "grapheme cluster". Unicode thus leads to a third concept of string, made of a sequence of Unicode code points, that sits between byte and character strings, and which is closer to the latter. I will call them "Unicode strings" (like in Python 2).

While Python can print strings of (user-perceived) characters, Python non-byte strings are essentially sequences of Unicode code points, not of user-perceived characters. The code point values are the ones used in Python's \u and \U Unicode string syntax. They should not be confused with the encoding of a character (and do not have to bear any relationship with it: Unicode code points can be encoded in various ways).

This has an important consequence: the length of a Python (Unicode) string is its number of code points, which is not always its number of user-perceived characters: thus s = "\u1100\u1161\u11a8"; print(s, "len", len(s)) (Python 3) gives 각 len 3 despite s having a single user-perceived (Korean) character (because it is represented with 3 code points—even if it does not have to, as print("\uac01") shows). However, in many practical circumstances, the length of a string is its number of user-perceived characters, because many characters are typically stored by Python as a single Unicode code point.

In Python 2, Unicode strings are called… "Unicode strings" (unicode type, literal form u"…"), while byte arrays are "strings" (str type, where the array of bytes can for instance be constructed with string literals "…"). In Python 3, Unicode strings are simply called "strings" (str type, literal form "…"), while byte arrays are "bytes" (bytes type, literal form b"…"). As a consequence, something like ""[0] gives a different result in Python 2 ('\xf0', a byte) and Python 3 ("", the first and only character).

With these few key points, you should be able to understand most encoding related questions!


Normally, when you print u"…" to a terminal, you should not get garbage: Python knows the encoding of your terminal. In fact, you can check what encoding the terminal expects:

% python
Python 2.7.6 (default, Nov 15 2013, 15:20:37) 
[GCC 4.2.1 Compatible Apple LLVM 5.0 (clang-500.2.79)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import sys
>>> print sys.stdout.encoding
UTF-8

If your input characters can be encoded with the terminal's encoding, Python will do so and will send the corresponding bytes to your terminal without complaining. The terminal will then do its best to display the characters after decoding the input bytes (at worst the terminal font does not have some of the characters and will print some kind of blank instead).

If your input characters cannot be encoded with the terminal's encoding, then it means that the terminal is not configured for displaying these characters. Python will complain (in Python with a UnicodeEncodeError since the character string cannot be encoded in a way that suits your terminal). The only possible solution is to use a terminal that can display the characters (either by configuring the terminal so that it accepts an encoding that can represent your characters, or by using a different terminal program). This is important when you distribute programs that can be used in different environments: messages that you print should be representable in the user's terminal. Sometimes it is thus best to stick to strings that only contain ASCII characters.

However, when you redirect or pipe the output of your program, then it is generally not possible to know what the input encoding of the receiving program is, and the above code returns some default encoding: None (Python 2.7) or UTF-8 (Python 3):

% python2.7 -c "import sys; print sys.stdout.encoding" | cat
None
% python3.4 -c "import sys; print(sys.stdout.encoding)" | cat
UTF-8

The encoding of stdin, stdout and stderr can however be set through the PYTHONIOENCODING environment variable, if needed:

% PYTHONIOENCODING=UTF-8 python2.7 -c "import sys; print sys.stdout.encoding" | cat
UTF-8

If the printing to a terminal does not produce what you expect, you can check the UTF-8 encoding that you put manually in is correct; for instance, your first character (\u001A) is not printable, if I'm not mistaken.

At http://wiki.python.org/moin/PrintFails, you can find a solution like the following, for Python 2.x:

import codecs
import locale
import sys

# Wrap sys.stdout into a StreamWriter to allow writing unicode.
sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout) 

uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni

For Python 3, you can check one of the questions asked previously on StackOverflow.

Eric O. Lebigot
  • 91,433
  • 48
  • 218
  • 260
  • 2
    @singularity: Thanks! I added some info for Python 3. – Eric O. Lebigot Dec 28 '10 at 15:06
  • 2
    Thank you, man! I needed this explanation for such a long time... It's a pity that I can give you only one upvote. – mik01aj Jun 11 '12 at 07:08
  • 3
    I am glad to have been of help, @m01! One of the motivations for writing this answer was that there were many pages on the web about Unicode and Python, but I found that despite being interesting, they never completely allowed me to solve concrete encoding problems… I truly believe that by keeping in mind the principles found in this answer *and taking the time* to use them when solving concrete encoding problems helps a lot. – Eric O. Lebigot Jun 11 '12 at 08:24
  • 3
    This is hands down the best explanation of unicode and python ever. The Python Unicode HOWTO should be replaced with this. – stantonk Jan 18 '13 at 19:42
  • 1
    Here, let me draw the “right-to-left override” character on this chalkboard… – icktoofay Aug 07 '13 at 03:52
  • @icktoofay: Interesting point, thank you. This Unicode character is nonetheless an instruction about how to *draw* characters, though. I amended my answer to reflect the subtlety that you described better than with the "etc." that was used instead before. – Eric O. Lebigot Aug 07 '13 at 08:29
  • 1
    it is very good explanation but it seems you've mixed user-perceived characters (grapheme clusters in Unicode) that you call just "characters" and Unicode codepoints (a single user-perceived character may be represented using multiple Unicode codepoints). `str` type in Python 3 represents an immutable sequence of Unicode codepoints, not user-perceived characters. Unrelated: for people who landed here due to the question title, you could put `PYTHONIOENCODING` example near the top of your answer. Also, OS may provide Unicode API e.g., `WriteConsoleW()` on Windows (no encoding is necessary). – jfs Dec 17 '15 at 18:10
  • @J.F.Sebastian Good points. I will include the distinction between user-perceived characters and Unicode codepoints. – Eric O. Lebigot Jan 06 '16 at 18:27
  • 1
    this tip saved me just when I was about to lose my sanity. I thought, it was my newly installed font ! – daparic Aug 06 '16 at 01:38
  • For python3 and windows command line the trick was using setting the encoding before. i.e set PYTHONIOENCODING=utf-8:surrogateescape and then run the program. taken from https://stackoverflow.com/a/7865013/1211174 – oak Apr 09 '18 at 13:49
  • Except for the surrogateescape option, this is precisely illustrated at the end of the second part, right? – Eric O. Lebigot Apr 10 '18 at 15:25
21

Python always encodes Unicode strings when writing to a terminal, file, pipe, etc. When writing to a terminal Python can usually determine the encoding of the terminal and use it correctly. When writing to a file or pipe Python defaults to the 'ascii' encoding unless explicitly told otherwise. Python can be told what to do when piping output through the PYTHONIOENCODING environment variable. A shell can set this variable before redirecting Python output to a file or pipe so the correct encoding is known.

In your case you've printed 4 uncommon characters that your terminal didn't support in its font. Here's some examples to help explain the behavior, with characters that are actually supported by my terminal (which uses cp437, not UTF-8).

Example 1

Note that the #coding comment indicates the encoding in which the source file is saved. I chose utf8 so I could support characters in source that my terminal could not. Encoding redirected to stderr so it can be seen when redirected to a file.

#coding: utf8
import sys
uni = u'αßΓπΣσµτΦΘΩδ∞φ'
print >>sys.stderr,sys.stdout.encoding
print uni

Output (run directly from terminal)

cp437
αßΓπΣσµτΦΘΩδ∞φ

Python correctly determined the encoding of the terminal.

Output (redirected to file)

None
Traceback (most recent call last):
  File "C:\ex.py", line 5, in <module>
    print uni
UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-13: ordinal not in range(128)

Python could not determine encoding (None) so used 'ascii' default. ASCII only supports converting the first 128 characters of Unicode.

Output (redirected to file, PYTHONIOENCODING=cp437)

cp437

and my output file was correct:

C:\>type out.txt
αßΓπΣσµτΦΘΩδ∞φ

Example 2

Now I'll throw in a character in the source that isn't supported by my terminal:

#coding: utf8
import sys
uni = u'αßΓπΣσµτΦΘΩδ∞φ马' # added Chinese character at end.
print >>sys.stderr,sys.stdout.encoding
print uni

Output (run directly from terminal)

cp437
Traceback (most recent call last):
  File "C:\ex.py", line 5, in <module>
    print uni
  File "C:\Python26\lib\encodings\cp437.py", line 12, in encode
    return codecs.charmap_encode(input,errors,encoding_map)
UnicodeEncodeError: 'charmap' codec can't encode character u'\u9a6c' in position 14: character maps to <undefined>

My terminal didn't understand that last Chinese character.

Output (run directly, PYTHONIOENCODING=437:replace)

cp437
αßΓπΣσµτΦΘΩδ∞φ?

Error handlers can be specified with the encoding. In this case unknown characters were replaced with ?. ignore and xmlcharrefreplace are some other options. When using UTF8 (which supports encoding all Unicode characters) replacements will never be made, but the font used to display the characters must still support them.

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
  • It is not exactly true that "When writing to a file or pipe Python defaults to the 'ascii' encoding unless explicitly told otherwise.". In fact, Python 3 uses UTF-8, on Mac OS X/Fink. – Eric O. Lebigot Jan 05 '11 at 10:16
  • 2
    Yes, Python 3 defaults to 'utf8', but based on the OP's sample, he's using Python 2.X, which defaults to 'ascii'. – Mark Tolonen Jan 05 '11 at 18:49
  • I could not get correct output by manipulating `PYTHONIOENCODING`. Doing `print string.encode("UTF-8")` as suggested by @Ismail worked for me. – tripleee Oct 02 '12 at 04:18
  • you can see Chinese characters if your font supports them even if `chcp` codepage does not support them. [To avoid `UnicodeEncodeError: 'charmap'`, you could install `win-unicode-console` package.](http://stackoverflow.com/a/32176732/4279) – jfs Dec 17 '15 at 18:15
  • My problem is that python-gitlab CLI prints Chinese characters well in cmd but the characters are garbage after being redirected into files. `PYTHONIOENCODING=utf-8` solves the problem. – ElpieKay Oct 14 '19 at 03:02
12

Encode it while printing

uni = u"\u001A\u0BC3\u1451\U0001D10C"
print uni.encode("utf-8")

This is because when you run the script manually python encodes it before outputting it to terminal, when you pipe it python does not encode it itself so you have to encode manually when doing I/O.

ismail
  • 46,010
  • 9
  • 86
  • 95
  • 4
    It still does not answer the question WTH is going on here. Why, out of the blue it decides to encode only when redirected, when this is supposed to be completely transparent to the process. – Maxim Sloyko Dec 28 '10 at 11:41
  • Why doesn't python encode it when performing redirection? Does python explicitly check and decide that it'll do things differently just to be difficult? – Arafangion Dec 28 '10 at 11:52
  • Shell intercepts the pipe, Python would have to check if stdout is a pipe. – ismail Dec 28 '10 at 11:53
  • 1
    does python even have a way to distinguish the two situations? I thougt (until now...) that there's no way it can know. – zedoo Dec 28 '10 at 11:58
  • 4
    Python can check if the output is a terminal, if its outputting to a pipe, then terminal type will be "dumb". I guess "dumb" should tell you why Python doesn't try to do anything automatical in this case, it can fail. – ismail Dec 28 '10 at 12:02
  • @Ismail If I understand it correctly, quite the opposite is going on here: it tries to do something (and fails) when trying to output to the pipe. – Maxim Sloyko Dec 28 '10 at 12:31
  • @maksymko, no, it doesn't do anything when you do pipe so its trying to interpret data it can't because its not encoded. The problem here is that when its outputting to terminally it does the work for you. – ismail Dec 28 '10 at 12:32
  • @Ismail, ah, I think I understand it now, thanks. Still, pretty strange behavior, if you ask me. – Maxim Sloyko Dec 28 '10 at 12:42
  • @maksymko the rule of thumb is, always use UTF-8 internally and encode it when doing I/O. – ismail Dec 28 '10 at 12:44
  • 1
    it produces mojibake if the environment uses a character encoding that is incompatible with utf-8 (e.g., it is common on Windows). Don't hardcode the character encoding of your environment inside your script. Configure your locale, or PYTHONIOENCODING, or install `win-unicode-console` (Windows), or accept a command-line parameter (if you must). – jfs Dec 17 '15 at 18:18