56

I'm running a recent Linux system where all my locales are UTF-8:

LANG=de_DE.UTF-8
LANGUAGE=
LC_CTYPE="de_DE.UTF-8"
LC_NUMERIC="de_DE.UTF-8"
LC_TIME="de_DE.UTF-8"
...
LC_IDENTIFICATION="de_DE.UTF-8"
LC_ALL=

Now I want to write UTF-8 encoded content to the console.

Right now Python uses UTF-8 for the FS encoding but sticks to ASCII for the default encoding :-(

>>> import sys
>>> sys.getdefaultencoding()
'ascii'
>>> sys.getfilesystemencoding()
'UTF-8'

I thought the best (clean) way to do this was setting the PYTHONIOENCODING environment variable. But it seems that Python ignores it. At least on my system I keep getting ascii as default encoding, even after setting the envvar.

# tried this in ~/.bashrc and ~/.profile (also sourced them)
# and on the commandline before running python
export PYTHONIOENCODING=UTF-8

If I do the following at the start of a script, it works though:

>>> import sys
>>> reload(sys)  # to enable `setdefaultencoding` again
<module 'sys' (built-in)>
>>> sys.setdefaultencoding("UTF-8")
>>> sys.getdefaultencoding()
'UTF-8'

But that approach seems unclean. So, what's a good way to accomplish this?

Workaround

Instead of changing the default encoding - which is not a good idea (see mesilliac's answer) - I just wrap sys.stdout with a StreamWriter like this:

sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout)

See this gist for a small utility function, that handles it.

wovano
  • 4,543
  • 5
  • 22
  • 49
Brutus
  • 7,139
  • 7
  • 36
  • 41
  • 1
    Perhaps this will work: #!/usr/bin/env python # -*- coding: utf-8 -*- – chessweb Jul 31 '12 at 14:10
  • And remember to put it at the very head of the source file. – starrify Jul 31 '12 at 14:30
  • 8
    That only effects how Python interprets literal strings in the source code. The IO encoding will still be ASCII. – Keith Jul 31 '12 at 14:33
  • 8
    `PYTHONIOENCODING` is not ignored; it's just that, as its name suggests, it [affects the encoding used for stdin/stdout/stderr](https://docs.python.org/2/using/cmdline.html#environment-variables), which is not what you're checking with [`sys.getdefaultencoding()`](https://docs.python.org/2/library/sys.html#sys.getdefaultencoding). – musiphil Dec 06 '14 at 22:34
  • @musiphil: True [sys.getdefaultencoding](https://docs.python.org/2/library/sys.html#sys.getdefaultencoding) reports something else. But since *override(ing) the encoding used for stdin/stdout/stderr* was exactly what I was looking for, I tried to change `PYTHONIOENCODING` with doesn't work. – Brutus Dec 08 '14 at 14:09
  • 3
    @Brutus: How did you test that it doesn't work? It seems to work for me. `python -c 'import sys; print sys.stdout.encoding'` gives `UTF-8`, and `PYTHONIOENCODING='C' python -c 'import sys; print sys.stdout.encoding'` gives `C`. – musiphil Dec 08 '14 at 18:54
  • I tried to `print` UTF-8 strings to the terminal (`sys.stdout`) and it bails (throws some encoding error — I don't remember exactly - it was 2 years ago), hence I wrote that [wrapper](https://gist.github.com/brutus/6c90b2342ac63054e12d). Just tried it again and not only does it work (no Exception), the default encoding seem to be `UTF-8` without me needing to set anything (Python 2.7.6 on Ubuntu 14.04). – Brutus Dec 09 '14 at 12:32
  • Your locale is used to determine what encoding to apply `sys.stdout.encoding`. Incorrectly installed locales can lead to `sys.stdout.encoding` being set to `ASCII`. `$ locale ` should return without errors – Alastair McCormack Nov 28 '15 at 23:24

5 Answers5

29

It seems accomplishing this is not recommended.

Fedora suggested using the system locale as the default, but apparently this breaks other things.

Here's a quote from the mailing-list discussion:

The only supported default encodings in Python are:

 Python 2.x: ASCII
 Python 3.x: UTF-8

If you change these, you are on your own and strange things will
start to happen. The default encoding does not only affect
the translation between Python and the outside world, but also
all internal conversions between 8-bit strings and Unicode.

Hacks like what's happening in the pango module (setting the
default encoding to 'utf-8' by reloading the site module in
order to get the sys.setdefaultencoding() API back) are just
downright wrong and will cause serious problems since Unicode
objects cache their default encoded representation.

Please don't enable the use of a locale based default encoding.

If all you want to achieve is getting the encodings of
stdout and stdin correctly setup for pipes, you should
instead change the .encoding attribute of those (only).

-- 
Marc-Andre Lemburg
eGenix.com
Nisse Engström
  • 4,738
  • 23
  • 27
  • 42
mesilliac
  • 1,474
  • 11
  • 8
24

This is how I do it:

#!/usr/bin/python2.7 -S

import sys
sys.setdefaultencoding("utf-8")
import site

Note the -S in the bangline. That tells Python to not automatically import the site module. The site module is what sets the default encoding and the removes the method so it can't be set again. But will honor what is already set.

Keith
  • 42,110
  • 11
  • 57
  • 76
  • Could you expand on this based on the answer that mesilliac gave? Is it still correct? – Arafangion Aug 27 '13 at 06:10
  • 1
    @Arafangion The method I use happens right at the very beginning of Python initialization. No caches have been created yet. I agree that using the reload trick is bad. This is because lots of other things may have already been instantiated or cached the original encoding. Thus I came up with this method which happens early. Note that no other imports are before it. It works for me. – Keith Aug 28 '13 at 05:24
  • While this has worked for me in tests, **I decided to avoid it**. It's just to unclear if I may run into any side effects and smells kinda fishy ;-) I just wrap `sys.stdout` in a `StreamWriter` with the default encoding (which should be UTF-8, at least in modern Linux systems): `sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout)`. – Brutus Apr 29 '14 at 09:59
  • 6
    This is a really bad idea. I've solved two questions in the last few weeks that were resolved by removing `sys.setdefaultencoding("utf-8")` from the user's code. IMHO, this just masks any underlying issues – Alastair McCormack Nov 28 '15 at 23:15
  • @AlastairMcCormack I've used this without any problems. As long as you know what's going on it's not a problem. What are the underlying issues that you think it masks? – Keith Nov 28 '15 at 23:47
  • @Keith, users are finding and using this hack to patch over console printing exceptions and other simple issues. I can't see a valid reason for using this unless you're working with 3rd party code which is assuming decoding/encoding – Alastair McCormack Nov 29 '15 at 00:01
  • @AlastairMcCormack I use when interfacing to databases that are UTF-8 encoded. I also use UTF-8 terminals so that's not a problem for me, either. It works well on custom built Linux systems where you can control all the encoding end-to-end. – Keith Nov 29 '15 at 00:05
  • This for example: http://stackoverflow.com/questions/34007540/unicode-encode-error-python – Alastair McCormack Nov 30 '15 at 21:17
  • Yes, I've seen that many times. But *I* don't have that problem. That link isn't quite the right way to do it, either. I always use UTF-8 everywhere, and Python was the last bit that needed to be changed like this. In my scenario it works great. But I agree it could be a problem with some that don't really understand what is going on. – Keith Nov 30 '15 at 23:39
  • This works like a charm but I wouldn't suggest it for non-trivial programs. Though if you're using custom code and rely on non-ASCII content, reconsider a language defaults to a UTF8. – buckaroo1177125 May 14 '16 at 01:11
  • I can't emphasise **how bad an idea this is**. This code alters a global default how Python 2 coerces between Unicode and bytestrings, affecting **all code running in the interpreter. This will break any code that relies on the default ASCII codec being used to signal that they have non-ASCII data, for example, including such code in 3rd-party libraries. Use explicit encoding and decoding instead. – Martijn Pieters Jul 21 '18 at 17:09
  • @MartijnPieters Yes, we know that. That's why you do it only for this one script (interpreter instance) because you know what you're doing, and you know it isn't going to be a problem in your script. You can't do explicit encoding/decoding in some cases in Python 2. – Keith Jul 21 '18 at 21:54
  • This gives `AttributeError: module 'sys' has no attribute 'setdefaultencoding'` on Python 3.7.2 – Evandro Coan May 04 '19 at 04:49
  • @user You don't need this with Python 3. – Keith May 05 '19 at 05:00
10

How to print UTF-8 encoded text to the console in Python < 3?

print u"some unicode text \N{EURO SIGN}"
print b"some utf-8 encoded bytestring \xe2\x82\xac".decode('utf-8')

i.e., if you have a Unicode string then print it directly. If you have a bytestring then convert it to Unicode first.

Your locale settings (LANG, LC_CTYPE) indicate a utf-8 locale and therefore (in theory) you could print a utf-8 bytestring directly and it should be displayed correctly in your terminal (if terminal settings are consistent with the locale settings and they should be) but you should avoid it: do not hardcode the character encoding of your environment inside your script; print Unicode directly instead.

There are many wrong assumptions in your question.

You do not need to set PYTHONIOENCODING with your locale settings, to print Unicode to the terminal. utf-8 locale supports all Unicode characters i.e., it works as is.

You do not need the workaround sys.stdout = codecs.getwriter(locale.getpreferredencoding())(sys.stdout). It may break if some code (that you do not control) does need to print bytes and/or it may break while printing Unicode to Windows console (wrong codepage, can't print undecodable characters). Correct locale settings and/or PYTHONIOENCODING envvar are enough. Also, if you need to replace sys.stdout then use io.TextIOWrapper() instead of codecs module like win-unicode-console package does.

sys.getdefaultencoding() is unrelated to your locale settings and to PYTHONIOENCODING. Your assumption that setting PYTHONIOENCODING should change sys.getdefaultencoding() is incorrect. You should check sys.stdout.encoding instead.

sys.getdefaultencoding() is not used when you print to the console. It may be used as a fallback on Python 2 if stdout is redirected to a file/pipe unless PYTHOHIOENCODING is set:

$ python2 -c'import sys; print(sys.stdout.encoding)'
UTF-8
$ python2 -c'import sys; print(sys.stdout.encoding)' | cat
None
$ PYTHONIOENCODING=utf8 python2 -c'import sys; print(sys.stdout.encoding)' | cat
utf8

Do not call sys.setdefaultencoding("UTF-8"); it may corrupt your data silently and/or break 3rd-party modules that do not expect it. Remember sys.getdefaultencoding() is used to convert bytestrings (str) to/from unicode in Python 2 implicitly e.g., "a" + u"b". See also, the quote in @mesilliac's answer.

Community
  • 1
  • 1
jfs
  • 399,953
  • 195
  • 994
  • 1,670
6

If the program does not display the appropriate characters on the screen, i.e., invalid symbol, run the program with the following command line:

PYTHONIOENCODING=utf8 python3 yourprogram.py

Or the following, if your program is a globally installed module:

PYTHONIOENCODING=utf8 yourprogram

On some platforms as Cygwin (mintty.exe terminal) with Anaconda Python (or Python 3), simply run export PYTHONIOENCODING=utf8 and later run the program does not work, and you are required to always do every time PYTHONIOENCODING=utf8 yourprogram to run the program correctly.

On Linux, in case of sudo, you can try to do pass the -E argument to export the user variables to the sudo process:

export PYTHONIOENCODING=utf8
sudo -E python yourprogram.py

If you try this and it did no work, you will need to enter on a sudo shell:

sudo /bin/bash
PYTHONIOENCODING=utf8 yourprogram

Related:

  1. How to print UTF-8 encoded text to the console in Python < 3?
  2. Changing default encoding of Python?
  3. Forcing UTF-8 over cp1252 (Python3)
  4. Permanently set Python path for Anaconda within Cygwin
  5. https://superuser.com/questions/1374339/what-does-the-e-in-sudo-e-do
  6. Why bash -c 'var=5 printf "$var"' does not print 5?
  7. https://unix.stackexchange.com/questions/296838/whats-the-difference-between-eval-and-exec
Evandro Coan
  • 8,560
  • 11
  • 83
  • 144
  • Is `utf8` case-sensitive? Also, is the only possible setting `utf8` or is `utf-8` also valid? It's just because I've been seeing so many variants... (and you're using two of them in your answer! ) – Gwyneth Llewelyn May 06 '19 at 22:35
  • 1
    I think at least for my Python `3.7.2`, the usage of `UTF-8` is case insensitive and I a not sure if it is ignoring the hyphen in UTF-8. – Evandro Coan May 06 '19 at 23:53
  • that makes sense — I was using Python `2.7.X` and I was unsure about what to use... – Gwyneth Llewelyn May 20 '19 at 21:40
3

While realizing the OP question is for Linux: when ending up here through a search engine, on Windows 10 the following fixes the issue:

set PYTHONIOENCODING=utf8
python myscript.py
SaeX
  • 17,240
  • 16
  • 77
  • 97