373

When piping the output of a Python program, the Python interpreter gets confused about encoding and sets it to None. This means a program like this:

# -*- coding: utf-8 -*-
print u"åäö"

will work fine when run normally, but fail with:

UnicodeEncodeError: 'ascii' codec can't encode character u'\xa0' in position 0: ordinal not in range(128)

when used in a pipe sequence.

What is the best way to make this work when piping? Can I just tell it to use whatever encoding the shell/filesystem/whatever is using?

The suggestions I have seen thus far is to modify your site.py directly, or hardcoding the defaultencoding using this hack:

# -*- coding: utf-8 -*-
import sys
reload(sys)
sys.setdefaultencoding('utf-8')
print u"åäö"

Is there a better way to make piping work?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Joakim Lundborg
  • 10,920
  • 6
  • 32
  • 39
  • 1
    See also http://stackoverflow.com/questions/4545661/unicodedecodeerror-when-redirecting-to-file – ShreevatsaR Oct 29 '13 at 06:13
  • 3
    If you have this problem on windows, you can also run `chcp 65001` before executing your script. This can have issues, but it often helps, and doesn't require a lot of typing (less than `set PYTHONIOENCODING=utf_8`). – Tomasz Gandor Oct 13 '17 at 16:08
  • chcp command is not the same as setting PYTHONIOENCODING. I think chcp is just configuration for the terminal itself and has nothing to do with writing to a file (which is what you are doing when piping stdout). Try `setx PYTHONENCODING utf-8` to make it permanent if you want to save typing. – ejm Aug 07 '19 at 12:17
  • https://stackoverflow.com/questions/48782529/exclude-ansi-escape-sequences-from-output-log-file – TOI 700 e May 18 '20 at 11:07
  • I faced a somewhat related issue, and found a solution here --> https://stackoverflow.com/questions/48782529/exclude-ansi-escape-sequences-from-output-log-file – TOI 700 e May 18 '20 at 11:09
  • @Tomasz, Great! Your environment variable, is the simplest and thus the best solution to overcoming this annoying thing! – Apostolos Jul 31 '20 at 09:12

12 Answers12

170

Your code works when run in an script because Python encodes the output to whatever encoding your terminal application is using. If you are piping you must encode it yourself.

A rule of thumb is: Always use Unicode internally. Decode what you receive, and encode what you send.

# -*- coding: utf-8 -*-
print u"åäö".encode('utf-8')

Another didactic example is a Python program to convert between ISO-8859-1 and UTF-8, making everything uppercase in between.

import sys
for line in sys.stdin:
    # Decode what you receive:
    line = line.decode('iso8859-1')

    # Work with Unicode internally:
    line = line.upper()

    # Encode what you send:
    line = line.encode('utf-8')
    sys.stdout.write(line)

Setting the system default encoding is a bad idea, because some modules and libraries you use can rely on the fact it is ASCII. Don't do it.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
nosklo
  • 217,122
  • 57
  • 293
  • 297
  • 11
    The problem is that the user doesn't want to specify encoding explicitly. He wants just use Unicode for IO. And the encoding he uses should be an encoding specified in locale settings, not in terminal application settings. AFAIK, Python 3 uses a *locale* encoding in this case. Changing `sys.stdout` seems like a more pleasant way. – Andrey Vlasovskikh Apr 02 '10 at 22:01
  • 4
    Encoding / decoding every string excplictly is bound to cause bugs when a encode or decode call is missing or added once to much somewhere. The output encoding can be set when output is a terminal, so it can be set when output is not a terminal. There is even a standard LC_CTYPE environment to specify it. It is a but in python that it doesn't respect this. – Rasmus Kaj May 31 '10 at 15:34
  • @Rasmus Kaj: If you consistently use a defined function for output you can be sure that it won't be missing or duplicated. Output encoding can't be "set". Accepting only unicode on `sys.stdout` (by replacing it with `codecs.getwriter`) breaks a lot of libraries in practice. – nosklo May 31 '10 at 20:48
  • 69
    This answer is wrong. You should *not* be manually converting on each input and output of your program; that's brittle and completely unmaintainable. – Glenn Maynard Apr 23 '12 at 23:29
  • 35
    @Glenn Maynard : so what is IYO the right answer? It's more helpful to tell us than just say *'This answer is wrong'* – smci Sep 18 '12 at 11:10
  • 4
    What libraries relies on stdout to only accept ASCII? Considering the amount of data that is not 7-bit ASCII that seems to be a very bad idea. – Erik Johansson Apr 02 '13 at 12:40
  • 4
    @ErikJohansson: it is not about stdout accepting whatever encoding. `sys.getdefaultencoding()` is used in many places e.g., `"а" + u"a"` expression uses it. Changing `sys.getdefaultencoding()` may introduce data-dependent bugs that might corrupt your data silently. – jfs Mar 21 '14 at 07:37
  • 15
    @smci: the answer is don't modify your script, set `PYTHONIOENCODING` if you are redirecting script's stdout in Python 2. – jfs Sep 23 '15 at 19:22
  • 6
    @Glenn Maynard Actually decoding and encoding is a good practice, from the [python doc](https://docs.python.org/3/howto/unicode.html): "Software should only work with Unicode strings internally, decoding the input data as soon as possible and encoding the output only at the end." – HenriTel Nov 07 '17 at 17:20
168

First, regarding this solution:

# -*- coding: utf-8 -*-
print u"åäö".encode('utf-8')

It's not practical to explicitly print with a given encoding every time. That would be repetitive and error-prone.

A better solution is to change sys.stdout at the start of your program, to encode with a selected encoding. Here is one solution I found on Python: How is sys.stdout.encoding chosen?, in particular a comment by "toka":

import sys
import codecs
sys.stdout = codecs.getwriter('utf8')(sys.stdout)
Craig McQueen
  • 41,871
  • 30
  • 130
  • 181
  • 8
    unfortunately, changing sys.stdout to accept only unicode breaks a lot of libraries that expect it to accept encoded bytestrings. – nosklo Dec 04 '09 at 19:14
  • 6
    nosklo: Then how can it work reliably and automaticly when output is a terminal? – Rasmus Kaj May 31 '10 at 15:36
  • 3
    @Rasmus Kaj: just define your own unicode printing function and use it every time you want to print unicode: `def myprint(unicodeobj): print unicodeobj.encode('utf-8')` -- you automatically detect terminal encoding by inspecting `sys.stdout.encoding`, but you should consider the case where it is `None` (i.e. when redirecting output to a file) so you need a separate function anyway. – nosklo May 31 '10 at 20:46
  • 3
    @nosklo: This does not make sys.stdout accept only Unicode. You can pass both str and unicode to a StreamWriter. – Glenn Maynard Apr 23 '12 at 23:30
  • And it'll screw any readline capabilities of ``pdb`` or I guess ``IPython`` as @JohnChain stated it. – vaab Feb 03 '15 at 04:37
  • 10
    I assume this answer was intended for python2. **Be careful with this on code which is intended to support both python2 and python3**. For me it's breaking stuff when ran under python3. – wim Jul 15 '16 at 16:24
140

You may want to try changing the environment variable "PYTHONIOENCODING" to "utf_8". I have written a page on my ordeal with this problem.

Tl;dr of the blog post:

import sys, locale, os
print(sys.stdout.encoding)
print(sys.stdout.isatty())
print(locale.getpreferredencoding())
print(sys.getfilesystemencoding())
print(os.environ["PYTHONIOENCODING"])
print(chr(246), chr(9786), chr(9787))

gives you

utf_8
False
ANSI_X3.4-1968
ascii
utf_8
ö ☺ ☻
sophros
  • 14,672
  • 11
  • 46
  • 75
daveagp
  • 2,599
  • 2
  • 20
  • 19
  • 3
    Changing sys.stdout.encoding maybe does not work, but changing sys.stdout does work: `sys.stdout = codecs.getwriter(encoding)(sys.stdout)`. This can be done from within the python program, so the user is not forced to set an env variable. – blueFast Oct 31 '13 at 07:43
  • 7
    @jeckyll2hide: `PYTHONIOENCODING` does work. How bytes are interpreted as a text is defined by *user* environment. Your script shouldn't be assuming and dictate the user environment what character encoding to use. If Python doesn't pick up the settings automatically then `PYTHONIOENCODING` can be set for your script. You shouldn't need it unless the output is redirected to a file/pipe. – jfs Mar 21 '14 at 07:50
  • 10
    +1. Honestly I think it's a Python bug. When I redirect output I want those same bytes that would be on the terminal, but in a file. Maybe it's not for everyone but it's a good default. Crashing hard with no explanation on a trivial operation that usually "just works" is a bad default. – SnakE Dec 12 '15 at 18:09
  • @SnakE: the only way I can rationalize why Python's implementation intentionally would enforce an iron-clad and permanent choice of encoding on stdout at startup time, might be in order to prevent any badly encoded stuff coming out later on. Or changing it is just an unimplemented feature, in which case allowing the user to change it later on would be a reasonable Python feature request. – daveagp Dec 13 '15 at 19:09
  • 3
    @daveagp My point is, behavior of my program should not depend on whether it is redirected or not---unless I really want it, in which case I implement it myself. Python behaves contrary to my experience with any other console tools. This violates the least surprise principle. I consider this a design flaw unless there is a very strong rationale. – SnakE Dec 14 '15 at 19:43
  • @SnakE: Yeah you have a good point. I looked at http://stackoverflow.com/questions/4545661 and an informative example is that I get different outputs for `python -c "import sys; print(sys.stdout.encoding)"` and `python -c "import sys; print(sys.stdout.encoding)"` I read about the `isatty` function and that was clarifying too; I guess some programs benefit a lot from knowing what kind of output they have, but the flipside is that there's more state than ideal sometimes in there. – daveagp Dec 16 '15 at 05:11
  • This answer solved it for me. Thanks to it, I noticed that it's the caller script/shell of the Python script which should set UTF8. In my case it was a `shell_exec()` from PHP and `putenv('LANG=en_US.UTF-8');` solved it. – Basj Oct 14 '19 at 07:04
64
export PYTHONIOENCODING=utf-8

do the job, but can't set it on python itself ...

what we can do is verify if isn't setting and tell the user to set it before call script with :

if __name__ == '__main__':
    if (sys.stdout.encoding is None):
        print >> sys.stderr, "please set python env PYTHONIOENCODING=UTF-8, example: export PYTHONIOENCODING=UTF-8, when write to stdout."
        exit(1)

Update to reply to the comment: the problem just exist when piping to stdout . I tested in Fedora 25 Python 2.7.13

python --version
Python 2.7.13

cat b.py

#!/usr/bin/env python
#-*- coding: utf-8 -*-
import sys

print sys.stdout.encoding

running ./b.py

UTF-8

running ./b.py | less

None
Sérgio
  • 6,966
  • 1
  • 48
  • 53
  • 2
    That check doesn't work in Python 2.7.13. `sys.stdout.encoding` is automatically set based on the `LC_CTYPE` locale value. – amphetamachine Jul 11 '17 at 15:00
  • 1
    https://mail.python.org/pipermail/python-list/2011-June/605938.html the example there still work , i.e. when you use ./a.py > out.txt sys.stdout.encoding is None – Sérgio Jul 11 '17 at 15:10
  • I had a similar problem with a sync script from Backblaze B2 and export PYTHONIOENCODING=utf-8 solved my problem. Python 2.7 on Debian Stretch. – 0x3333 Mar 28 '19 at 13:41
7

I'm surprised this answer has not been posted here yet

Since Python 3.7 you can change the encoding of standard streams with reconfigure():

sys.stdout.reconfigure(encoding='utf-8')

You can also modify how encoding errors are handled by adding an errors parameter.

https://stackoverflow.com/a/52372390/15675011

qz-
  • 674
  • 1
  • 4
  • 14
5

I had a similar issue last week. It was easy to fix in my IDE (PyCharm).

Here was my fix:

Starting from PyCharm menu bar: File -> Settings... -> Editor -> File Encodings, then set: "IDE Encoding", "Project Encoding" and "Default encoding for properties files" ALL to UTF-8 and she now works like a charm.

Hope this helps!

CLaFarge
  • 1,277
  • 11
  • 16
5

Since Python 3.7, we can use Python UTF-8 Mode, by using command line option -X utf8:

 python -X utf8 testzh.py

The script testzh.py contains

print("Content-type: text/html; charset=UTF-8\n") 
print("地球你好!")

To set Windows 10 Internet Service IIS as CGI Script handler,

We set Executable as this:

"C:\Program Files\Python39\python.exe" -X utf8 %s

enter image description here

This works for Chinese Ideograms as expected on Browser Microsoft.Edge like this screenshot: Otherwise, error occurs.

enter image description here

Please see https://docs.python.org/3/library/os.html#utf8-mode

jacouh
  • 8,473
  • 5
  • 32
  • 43
4

An arguable sanitized version of Craig McQueen's answer.

import sys, codecs
class EncodedOut:
    def __init__(self, enc):
        self.enc = enc
        self.stdout = sys.stdout
    def __enter__(self):
        if sys.stdout.encoding is None:
            w = codecs.getwriter(self.enc)
            sys.stdout = w(sys.stdout)
    def __exit__(self, exc_ty, exc_val, tb):
        sys.stdout = self.stdout

Usage:

with EncodedOut('utf-8'):
    print u'ÅÄÖåäö'
Tompa
  • 5,131
  • 2
  • 13
  • 13
3

I just thought I'd mention something here which I had to spent a long time experimenting with before I finally realised what was going on. This may be so obvious to everyone here that they haven't bothered mentioning it. But it would've helped me if they had, so on that principle...!

NB: I am using Jython specifically, v 2.7, so just possibly this may not apply to CPython...

NB2: the first two lines of my .py file here are:

# -*- coding: utf-8 -*-
from __future__ import print_function

The "%" (AKA "interpolation operator") string construction mechanism causes ADDITIONAL problems too... If the default encoding of the "environment" is ASCII and you try to do something like

print( "bonjour, %s" % "fréd" )  # Call this "print A"

You will have no difficulty running in Eclipse... In a Windows CLI (DOS window) you will find that the encoding is code page 850 (my Windows 7 OS) or something similar, which can handle European accented characters at least, so it'll work.

print( u"bonjour, %s" % "fréd" ) # Call this "print B"

will also work.

If, OTOH, you direct to a file from the CLI, the stdout encoding will be None, which will default to ASCII (on my OS anyway), which will not be able to handle either of the above prints... (dreaded encoding error).

So then you might think of redirecting your stdout by using

sys.stdout = codecs.getwriter('utf8')(sys.stdout)

and try running in the CLI piping to a file... Very oddly, print A above will work... But print B above will throw the encoding error! The following will however work OK:

print( u"bonjour, " + "fréd" ) # Call this "print C"

The conclusion I have come to (provisionally) is that if a string which is specified to be a Unicode string using the "u" prefix is submitted to the %-handling mechanism it appears to involve the use of the default environment encoding, regardless of whether you have set stdout to redirect!

How people deal with this is a matter of choice. I would welcome a Unicode expert to say why this happens, whether I've got it wrong in some way, what the preferred solution to this, whether it also applies to CPython, whether it happens in Python 3, etc., etc.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
mike rodent
  • 14,126
  • 11
  • 103
  • 157
  • That's not odd, that's because `"fréd"` is a byte sequence and not a Unicode string, so the `codecs.getwriter` wrapper will leave it alone. You need a leading `u`, or `from __future__ import unicode_literals`. – Matthias Urlichs Nov 16 '14 at 06:33
  • @MatthiasUrlichs OK... thanks... But I just find encoding one of the most infuriating aspects of IT. Where do you get your understanding from? For example, I just posted another question about encoding here: https://stackoverflow.com/questions/44483067/passing-an-encoding-switch-to-the-jvm-for-a-gradle-javaexec-task: this is about Java, Eclipse, Cygwin & Gradle. If your expertise goes this far, please help... above all I'd like to know where to learn more! – mike rodent Jun 12 '17 at 18:59
3

I ran into this problem in a legacy application, and it was difficult to identify where what was printed. I helped myself with this hack:

# encoding_utf8.py
import codecs
import builtins


def print_utf8(text, **kwargs):
    print(str(text).encode('utf-8'), **kwargs)


def print_utf8(fn):
    def print_fn(*args, **kwargs):
        return fn(str(*args).encode('utf-8'), **kwargs)
    return print_fn


builtins.print = print_utf8(print)

On top of my script, test.py:

import encoding_utf8
string = 'Axwell Λ Ingrosso'
print(string)

Note that this changes ALL calls to print to use an encoding, so your console will print this:

$ python test.py
b'Axwell \xce\x9b Ingrosso'
cessor
  • 1,028
  • 11
  • 16
2

On Windows, I had this problem very often when running a Python code from an editor (like Sublime Text), but not if running it from command-line.

In this case, check your editor's parameters. In the case of SublimeText, this Python.sublime-build solved it:

{
  "cmd": ["python", "-u", "$file"],
  "file_regex": "^[ ]*File \"(...*?)\", line ([0-9]*)",
  "selector": "source.python",
  "encoding": "utf8",
  "env": {"PYTHONIOENCODING": "utf-8", "LANG": "en_US.UTF-8"}
}
Basj
  • 41,386
  • 99
  • 383
  • 673
2

I could "automate" it with a call to:

def __fix_io_encoding(last_resort_default='UTF-8'):
  import sys
  if [x for x in (sys.stdin,sys.stdout,sys.stderr) if x.encoding is None] :
      import os
      defEnc = None
      if defEnc is None :
        try:
          import locale
          defEnc = locale.getpreferredencoding()
        except: pass
      if defEnc is None :
        try: defEnc = sys.getfilesystemencoding()
        except: pass
      if defEnc is None :
        try: defEnc = sys.stdin.encoding
        except: pass
      if defEnc is None :
        defEnc = last_resort_default
      os.environ['PYTHONIOENCODING'] = os.environ.get("PYTHONIOENCODING",defEnc)
      os.execvpe(sys.argv[0],sys.argv,os.environ)
__fix_io_encoding() ; del __fix_io_encoding

Yes, it's possible to get an infinite loop here if this "setenv" fails.

jno
  • 29
  • 2