179

I have many "can't encode" and "can't decode" problems with Python when I run my applications from the console. But in the Eclipse PyDev IDE, the default character encoding is set to UTF-8, and I'm fine.

I searched around for setting the default encoding, and people say that Python deletes the sys.setdefaultencoding function on startup, and we can not use it.

So what's the best solution for it?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Ali Nadalizadeh
  • 2,726
  • 3
  • 22
  • 24
  • 1
    See the blog post *[The Illusive setdefaultencoding](http://blog.ianbicking.org/illusive-setdefaultencoding.html)*. – djc Feb 16 '10 at 20:49
  • 3
    `The best solution is to learn to use encode and decode correctly instead of using hacks.` This was certainly possible with *python2* at the cost of always remembering to do so / consistently using your own interface. My experience suggests that this becomes highly problematic when you are writing code that you want to work with both python2 and python3. – Att Righ May 25 '17 at 20:38

14 Answers14

175

Here is a simpler method (hack) that gives you back the setdefaultencoding() function that was deleted from sys:

import sys
# sys.setdefaultencoding() does not exist, here!
reload(sys)  # Reload does the trick!
sys.setdefaultencoding('UTF8')

(Note for Python 3.4+: reload() is in the importlib library.)

This is not a safe thing to do, though: this is obviously a hack, since sys.setdefaultencoding() is purposely removed from sys when Python starts. Reenabling it and changing the default encoding can break code that relies on ASCII being the default (this code can be third-party, which would generally make fixing it impossible or dangerous).

PS: This hack doesn't seem to work with Python 3.9 anymore.

Eric O. Lebigot
  • 91,433
  • 48
  • 218
  • 260
  • 11
    I downvoted, because that answer doesn't help for running existing applications (which is one way to interpret the question), is wrong when you are writing/maintaining an application and dangerous when writing a library. The right way is to set `LC_CTYPE` (or in an application, check whether it is set right and abort with a meaningful error message). – ibotty Aug 09 '15 at 19:33
  • @ibotty I do agree that this answer is a hack and that it is dangerous to use it. It does answer the question, though ("Changing default encoding of Python?"). Do you have a reference about the effect of the environment variable LC_CTYPE on the Python interpreter? – Eric O. Lebigot Aug 10 '15 at 00:16
  • 1
    well, it did not mention, it's a hack at first. other than that, dangerous answers that lack any mention that they are, are not helpful. – ibotty Aug 10 '15 at 11:46
  • @EOL, and the references re LC_CTYPE: https://docs.python.org/2/library/locale.html and https://docs.python.org/3/library/locale.html – ibotty Aug 10 '15 at 11:49
  • @ibotty `LC_CTYPE` is independent from the "default encoding of Python" that the question refers to (`sys.getdefaultencoding()` returns `ascii` for me, with `LC_CTYPE=en_US.UTF-8`). So, what (different) problem that you have in mind does `LC_CTYPE` solve? – Eric O. Lebigot Aug 10 '15 at 17:55
  • 1
    @EOL you are right. It does effect the preferredencoding though (in python 2 and 3): `LC_CTYPE=C python -c 'import locale; print( locale.getpreferredencoding())'` – ibotty Aug 11 '15 at 08:05
  • 1
    @user2394901 The use of sys.setdefaultencoding() has always been discouraged!! And the encoding of py3k is hard-wired to "utf-8" and changing it raises an error. – Marlon Abeykoon Jun 07 '16 at 09:31
  • I'm using ipython notebooks, and in my case, as soon as I execute the hack the print function no longer prints to stdout. – kiril Sep 16 '16 at 13:11
  • … and print still does not work when setting back the default encoding to 'ascii'… This smells like some magic done by the notebook… In any case, this solution is only a hack, and can thus break. – Eric O. Lebigot Sep 21 '16 at 17:15
  • Gives me `NameError: name 'reload' is not defined` – Superdooperhero Feb 13 '20 at 17:27
  • In Python 3.4+, you need to use the `reload()` from the `importlib` library. – Eric O. Lebigot Mar 01 '20 at 09:08
  • 1
    even after the reload: 'sys' has no attribute 'setdefaultencoding' – negstek Aug 13 '22 at 19:53
  • 1
    What version of Python are you using? I am observing the same thing with Python 3.9.13 (so I updated this old answer). – Eric O. Lebigot Aug 14 '22 at 20:06
97

If you get this error when you try to pipe/redirect output of your script

UnicodeEncodeError: 'ascii' codec can't encode characters in position 0-5: ordinal not in range(128)

Just export PYTHONIOENCODING in console and then run your code.

export PYTHONIOENCODING=utf8
Neuron
  • 5,141
  • 5
  • 38
  • 59
iman
  • 21,202
  • 8
  • 32
  • 31
  • 3
    This is the only solution that made any difference for me. - I'm on Debian 7, with broken locale settings. Thanks. – Pryo Feb 09 '15 at 10:47
  • 4
    Set `LC_CTYPE` to something sensible instead. It makes all the other programs happy as well. – ibotty Jun 17 '15 at 09:40
  • 7
    A bigger bug in Python3 is, that `PYTHONIOENCODING=utf8` is not the default. This makes scripts break just because `LC_ALL=C` – Tino Sep 27 '15 at 23:28
  • `Set LC_CTYPE to something sensible instead` This is a reasonable suggestion. This doesn't work so well when you are trying to distribute code that *just works* on another person's system. – Att Righ May 25 '17 at 20:43
  • Debian and Redhat OSes use a `C.utf8` locale to provide more sensible C. glibc upstream is working on adding it, so perhaps we should not be blaming Python for respecting locale settings\…? – Mingye Wang Mar 13 '18 at 19:05
  • Thank you so much! Spent hours online and eventually came across with your response. Adding that line into `.bashrc` of Cygwin addresses the encoding problem with Python. – misaligar Jun 29 '19 at 03:08
  • Just to say that I have no problem with pure python 2.7. Only got it when I use python 2.7 with __future__, so I had to use those solutions (PYTHONIOENCODING or sys.reload). No problem neither with python 3. – Eric H. Aug 20 '20 at 08:32
  • Note that Heroku Dynos do NOT have the default encoding set to utf8, which makes for a frustrating debugging experience (cough cough 4 hours until I found this answer...). This solves the problem on Heroku. – The Aelfinn Jan 24 '21 at 01:01
53

A) To control sys.getdefaultencoding() output:

python -c 'import sys; print(sys.getdefaultencoding())'

ascii

Then

echo "import sys; sys.setdefaultencoding('utf-16-be')" > sitecustomize.py

and

PYTHONPATH=".:$PYTHONPATH" python -c 'import sys; print(sys.getdefaultencoding())'

utf-16-be

You could put your sitecustomize.py higher in your PYTHONPATH.

Also you might like to try reload(sys).setdefaultencoding by @EOL

B) To control stdin.encoding and stdout.encoding you want to set PYTHONIOENCODING:

python -c 'import sys; print(sys.stdin.encoding, sys.stdout.encoding)'

ascii ascii

Then

PYTHONIOENCODING="utf-16-be" python -c 'import sys; 
print(sys.stdin.encoding, sys.stdout.encoding)'

utf-16-be utf-16-be

Finally: you can use A) or B) or both!

Community
  • 1
  • 1
lukmdo
  • 7,489
  • 5
  • 30
  • 23
  • (python2 only) separate but interesting is extending above with `from __future__ import unicode_literals` see [discussion](http://python-future.org/imports.html#unicode-literals) – lukmdo Feb 04 '15 at 00:34
18

Starting with PyDev 3.4.1, the default encoding is not being changed anymore. See this ticket for details.

For earlier versions a solution is to make sure PyDev does not run with UTF-8 as the default encoding. Under Eclipse, run dialog settings ("run configurations", if I remember correctly); you can choose the default encoding on the common tab. Change it to US-ASCII if you want to have these errors 'early' (in other words: in your PyDev environment). Also see an original blog post for this workaround.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
ChristopheD
  • 112,638
  • 29
  • 165
  • 179
  • 1
    Thanks Chris. Especially considering Mark T's comment above, your answer seems to be the most appropriate to me. And for somebody who's not primarily an Eclipse/PyDev user, I never would have figured that out on my own. – Sean Apr 30 '11 at 00:40
  • I'd like to change this globally (rather than once per run configuration), but haven't figured out how - have asked a separate q: http://stackoverflow.com/questions/9394277/is-there-a-way-to-change-the-default-encoding-for-all-run-configurations-within – Tim Diggins Feb 22 '12 at 11:58
13

Regarding python2 (and python2 only), some of the former answers rely on using the following hack:

import sys
reload(sys)  # Reload is a hack
sys.setdefaultencoding('UTF8')

It is discouraged to use it (check this or this)

In my case, it come with a side-effect: I'm using ipython notebooks, and once I run the code the ´print´ function no longer works. I guess there would be solution to it, but still I think using the hack should not be the correct option.

After trying many options, the one that worked for me was using the same code in the sitecustomize.py, where that piece of code is meant to be. After evaluating that module, the setdefaultencoding function is removed from sys.

So the solution is to append to file /usr/lib/python2.7/sitecustomize.py the code:

import sys
sys.setdefaultencoding('UTF8')

When I use virtualenvwrapper the file I edit is ~/.virtualenvs/venv-name/lib/python2.7/sitecustomize.py.

And when I use with python notebooks and conda, it is ~/anaconda2/lib/python2.7/sitecustomize.py

Community
  • 1
  • 1
kiril
  • 4,914
  • 1
  • 30
  • 40
8

There is an insightful blog post about it.

See https://anonbadger.wordpress.com/2015/06/16/why-sys-setdefaultencoding-will-break-code/.

I paraphrase its content below.

In python 2 which was not as strongly typed regarding the encoding of strings you could perform operations on differently encoded strings, and succeed. E.g. the following would return True.

u'Toshio' == 'Toshio'

That would hold for every (normal, unprefixed) string that was encoded in sys.getdefaultencoding(), which defaulted to ascii, but not others.

The default encoding was meant to be changed system-wide in site.py, but not somewhere else. The hacks (also presented here) to set it in user modules were just that: hacks, not the solution.

Python 3 did changed the system encoding to default to utf-8 (when LC_CTYPE is unicode-aware), but the fundamental problem was solved with the requirement to explicitly encode "byte"strings whenever they are used with unicode strings.

ibotty
  • 707
  • 4
  • 10
5

First: reload(sys) and setting some random default encoding just regarding the need of an output terminal stream is bad practice. reload often changes things in sys which have been put in place depending on the environment - e.g. sys.stdin/stdout streams, sys.excepthook, etc.

Solving the encode problem on stdout

The best solution I know for solving the encode problem of print'ing unicode strings and beyond-ascii str's (e.g. from literals) on sys.stdout is: to take care of a sys.stdout (file-like object) which is capable and optionally tolerant regarding the needs:

  • When sys.stdout.encoding is None for some reason, or non-existing, or erroneously false or "less" than what the stdout terminal or stream really is capable of, then try to provide a correct .encoding attribute. At last by replacing sys.stdout & sys.stderr by a translating file-like object.

  • When the terminal / stream still cannot encode all occurring unicode chars, and when you don't want to break print's just because of that, you can introduce an encode-with-replace behavior in the translating file-like object.

Here an example:

#!/usr/bin/env python
# encoding: utf-8
import sys

class SmartStdout:
    def __init__(self, encoding=None, org_stdout=None):
        if org_stdout is None:
            org_stdout = getattr(sys.stdout, 'org_stdout', sys.stdout)
        self.org_stdout = org_stdout
        self.encoding = encoding or \
                        getattr(org_stdout, 'encoding', None) or 'utf-8'
    def write(self, s):
        self.org_stdout.write(s.encode(self.encoding, 'backslashreplace'))
    def __getattr__(self, name):
        return getattr(self.org_stdout, name)

if __name__ == '__main__':
    if sys.stdout.isatty():
        sys.stdout = sys.stderr = SmartStdout()

    us = u'aouäöüфżß²'
    print us
    sys.stdout.flush()

Using beyond-ascii plain string literals in Python 2 / 2 + 3 code

The only good reason to change the global default encoding (to UTF-8 only) I think is regarding an application source code decision - and not because of I/O stream encodings issues: For writing beyond-ascii string literals into code without being forced to always use u'string' style unicode escaping. This can be done rather consistently (despite what anonbadger's article says) by taking care of a Python 2 or Python 2 + 3 source code basis which uses ascii or UTF-8 plain string literals consistently - as far as those strings potentially undergo silent unicode conversion and move between modules or potentially go to stdout. For that, prefer "# encoding: utf-8" or ascii (no declaration). Change or drop libraries which still rely in a very dumb way fatally on ascii default encoding errors beyond chr #127 (which is rare today).

And do like this at application start (and/or via sitecustomize.py) in addition to the SmartStdout scheme above - without using reload(sys):

...
def set_defaultencoding_globally(encoding='utf-8'):
    assert sys.getdefaultencoding() in ('ascii', 'mbcs', encoding)
    import imp
    _sys_org = imp.load_dynamic('_sys_org', 'sys')
    _sys_org.setdefaultencoding(encoding)

if __name__ == '__main__':
    sys.stdout = sys.stderr = SmartStdout()
    set_defaultencoding_globally('utf-8') 
    s = 'aouäöüфżß²'
    print s

This way string literals and most operations (except character iteration) work comfortable without thinking about unicode conversion as if there would be Python3 only. File I/O of course always need special care regarding encodings - as it is in Python3.

Note: plains strings then are implicitely converted from utf-8 to unicode in SmartStdout before being converted to the output stream enconding.

kxr
  • 4,841
  • 1
  • 49
  • 32
5

Here is the approach I used to produce code that was compatible with both python2 and python3 and always produced utf8 output. I found this answer elsewhere, but I can't remember the source.

This approach works by replacing sys.stdout with something that isn't quite file-like (but still only using things in the standard library). This may well cause problems for your underlying libraries, but in the simple case where you have good control over how sys.stdout out is used through your framework this can be a reasonable approach.

sys.stdout = io.open(sys.stdout.fileno(), 'w', encoding='utf8')
Att Righ
  • 1,439
  • 1
  • 16
  • 29
2

This is a quick hack for anyone who is (1) On a Windows platform (2) running Python 2.7 and (3) annoyed because a nice piece of software (i.e., not written by you so not immediately a candidate for encode/decode printing maneuvers) won't display the "pretty unicode characters" in the IDLE environment (Pythonwin prints unicode fine), For example, the neat First Order Logic symbols that Stephan Boyer uses in the output from his pedagogic prover at First Order Logic Prover.

I didn't like the idea of forcing a sys reload and I couldn't get the system to cooperate with setting environment variables like PYTHONIOENCODING (tried direct Windows environment variable and also dropping that in a sitecustomize.py in site-packages as a one liner ='utf-8').

So, if you are willing to hack your way to success, go to your IDLE directory, typically: "C:\Python27\Lib\idlelib" Locate the file IOBinding.py. Make a copy of that file and store it somewhere else so you can revert to original behavior when you choose. Open the file in the idlelib with an editor (e.g., IDLE). Go to this code area:

# Encoding for file names
filesystemencoding = sys.getfilesystemencoding()

encoding = "ascii"
if sys.platform == 'win32':
    # On Windows, we could use "mbcs". However, to give the user
    # a portable encoding name, we need to find the code page 
    try:
        # --> 6/5/17 hack to force IDLE to display utf-8 rather than cp1252
        # --> encoding = locale.getdefaultlocale()[1]
        encoding = 'utf-8'
        codecs.lookup(encoding)
    except LookupError:
        pass

In other words, comment out the original code line following the 'try' that was making the encoding variable equal to locale.getdefaultlocale (because that will give you cp1252 which you don't want) and instead brute force it to 'utf-8' (by adding the line 'encoding = 'utf-8' as shown).

I believe this only affects IDLE display to stdout and not the encoding used for file names etc. (that is obtained in the filesystemencoding prior). If you have a problem with any other code you run in IDLE later, just replace the IOBinding.py file with the original unmodified file.

1

You could change the encoding of your entire operating system. On Ubuntu you can do this with

sudo apt install locales 
sudo locale-gen en_US en_US.UTF-8    
sudo dpkg-reconfigure locales
Boris Verkhovskiy
  • 14,854
  • 11
  • 100
  • 103
0

This fixed the issue for me.

import os
os.environ["PYTHONIOENCODING"] = "utf-8"
twasbrillig
  • 17,084
  • 9
  • 43
  • 67
  • Did not for me. But worked when exported the variable in the shell before entering python, or used reload(sys); sys.defaultencoding("utf-8"). – Eric H. Aug 20 '20 at 08:27
0

set default encoding of OS to be UTF-8. Eg., on ubuntu edit file /etc/default/locale and set:

LANG=en_US.UTF-8
LANGUAGE=en_US.UTF-8
LC_ALL=en_US.UTF-8
0

If you only want stable UTF-8 support on file read/write without same declarations everywhere, here are two solutions:

1. Patch io module at runtime (danger operation at your own risk)

import pathlib as pathlib
import tempfile

import chardet


def patchIOWithUtf8Default():
    import builtins
    import importlib.util
    import sys
    spec = importlib.util.find_spec("io")
    module = importlib.util.module_from_spec(spec)
    exec(compile(spec.loader.get_source(spec.name) + """
    def open(*args, **kwargs):
        args = list(args)
        mode = kwargs.get('mode', (args + [''])[1])
        if (len(args) < 4 and 'b' not in mode) or 'encoding' in kwargs:
            kwargs['encoding'] = 'utf8'
        elif len(args) >= 4 and args[3] is None:
            args[3] = 'utf8'
        return _io.open(*args, **kwargs)
    """, module.__spec__.origin, "exec"), module.__dict__)
    sys.modules[module.__name__] = module
    builtins.open = __import__("io").open
    importlib.reload(importlib.import_module("pathlib"))


def main():
    patchIOWithUtf8Default()
    filename = tempfile.mktemp()
    text = "Common\n常\nSense\n识\n天地玄黄"
    print("Original text:", repr(text))
    pathlib.Path(filename).write_text(text)
    encoding = chardet.detect(open(filename, mode="rb").read())["encoding"]
    print("Written encoding by pathlib:", encoding)
    print("Written text by pathlib:", repr(open(filename, newline="", encoding=encoding).read()))


if __name__ == '__main__':
    main()

Sample output:

Original text: 'Common\n常\nSense\n识\n天地玄黄'
Written encoding by pathlib: utf-8
Written text by pathlib: 'Common\r\n常\r\nSense\r\n识\r\n天地玄黄'

2. Use 3rd library as pathlib wrapper

https://github.com/baijifeilong/IceSpringPathLib

pip install IceSpringPathLib

import pathlib
import tempfile

import chardet

import IceSpringPathLib

tempfile.mktemp()
filename = tempfile.mktemp()
text = "Common\n常\nSense\n识\n天地玄黄"
print("Original text:", repr(text))

pathlib.Path(filename).write_text(text)
encoding = chardet.detect(open(filename, mode="rb").read())["encoding"]
print("\nWritten text by pathlib:", repr(open(filename, newline="", encoding=encoding).read()))
print("Written encoding by pathlib:", encoding)

IceSpringPathLib.Path(filename).write_text(text)
encoding = chardet.detect(open(filename, mode="rb").read())["encoding"]
print("\nWritten text by IceSpringPathLib:", repr(open(filename, newline="", encoding=encoding).read()))
print("Written encoding by IceSpringPathLib:", encoding)

Sample output:

Original text: 'Common\n常\nSense\n识\n天地玄黄'

Written text by pathlib: 'Common\r\n常\r\nSense\r\n识\r\n天地玄黄'
Written encoding by pathlib: GB2312

Written text by IceSpringPathLib: 'Common\n常\nSense\n识\n天地玄黄'
Written encoding by IceSpringPathLib: utf-8
BaiJiFeiLong
  • 3,716
  • 1
  • 30
  • 28
0

windows set environment variable PYTHONUTF8=1

walkman
  • 1,743
  • 2
  • 10
  • 10