Python 2.7 unicode confusion again

Question

I've already read this:

Setting the correct encoding when piping stdout in Python

And I'm trying to stick with the rule of thumb: "Always use Unicode internally. Decode what you receive, and encode what you send."

So here's my main file:

# coding: utf-8

import os
import sys

from myplugin import MyPlugin
if __name__ == '__main__':
    c = MyPlugin()
    a = unicode(open('myfile.txt').read().decode('utf8'))
    print(c.generate(a).encode('utf8'))

What is getting on my nerves is that:

I read in a utf8 file so I decode it.
then I force convert it to unicode which gives unicode(open('myfile.txt').read().decode('utf8'))
then I try to output it to a terminal
on my Linux shell I need to re-encode it to utf8, and I guess this is normal because I'm working all this time on an unicode string, then to output it, I have to re-encode it in utf8 (correct me if I'm wrong here)
when I run it with Pycharm under Windows, it's twice utf8 encoded, which gives me things like agrÃ©able, dÃ©jÃ. So if I remove encode('utf8') (which changes the last line to print(c.generate(a)) then it works with Pycharm, but doesn't work anymore with Linux, where I get: 'ascii' codec can't encode character u'\xe9' in position blabla you know the problem.

If I try in the command line:

Linux/shell ssh: import sys sys.stdout.encoding I get 'UTF-8'
Linux/shell in my code: import sys sys.stdout.encoding I get None WTF??
Windows/Pycharm: import sys sys.stdout.encoding I get 'windows-1252'

What is the best way to code this so it works on both environments?

I like to use [codecs](https://docs.python.org/2.7/library/codecs.html) to open files in `utf-8` and always use `u"anystring"` inside the code. The file has to be saved in utf-8, of course. I haven't much problems then. If you're using IDE this has to be configured also to read utf-8 as default, also shell. This might not help you at all, but this is the way for me not to get into much encoding trouble. — colidyre, Sep 30 '15 at 15:56

warvariuc · Answer 1 · 2015-09-30T16:04:04.347

0

unicode(open('myfile.txt').read().decode('utf8'))

no need to wrap with unicode because result of str.decode is already unicode.

print(c.generate(a).encode('utf8'))

no need to encode because Python will encode the string itself depending on the terminal encoding.

So this is the correct way to do

print(c.generate(a))

You are getting 'ascii' codec can't encode character u'\xe9' in position because your Linux terminal has ascii encoding, so it's not possible for Python to print unicode characters to it.

See https://wiki.python.org/moin/PrintFails

I would suggest fixing your terminal (environment), not the code. You should not depend on the terminal encoding, especially as usually you print this info to a file.

If you still want to print it to any terminal which supports ASCII, you can use str.encode('unicode-escape'):

>>> print(u'щхжы'.encode('unicode-escape'))
\u0449\u0445\u0436\u044b

But it will be not very readable by humans, so I don't see the point.

edited Sep 30 '15 at 16:04

answered Sep 30 '15 at 15:57

warvariuc

57,116
41
173
227

You say that my terminal has ascii encoding so what I dont understand is why, if I launch python in my terminal as a command line, and try sys.stdout.encoding I get 'UTF-8', whereas if I launch it with "python mymain.py", I get "None" as encoding? – Olivier Pons Sep 30 '15 at 16:05
If I try your sample in the shell, through python command line, `print u"\u03A9"` works whereas in the main file, it doesnt. Where could this problem come from? – Olivier Pons Sep 30 '15 at 16:07
Ok found the solution: my last line should be `print(c.generate(a).encode(sys.stdout.encoding))` – Olivier Pons Sep 30 '15 at 16:09
As mentioned [here](https://wiki.python.org/moin/PrintFails) > When Python does not detect the desired character set of the output, it sets sys.stdout.encoding to None, and print will invoke the "ascii" codec. < I don't know why it's like this in the case when you use a script. How are you launching the script? – warvariuc Sep 30 '15 at 16:14
I'm launching it like this: `python myfile.py` – Olivier Pons Sep 30 '15 at 16:45
Are you doing any piping like `python myfile.py | cat`? – warvariuc Sep 30 '15 at 16:49
@OlivierPons See [this](http://stackoverflow.com/questions/492483/setting-the-correct-encoding-when-piping-stdout-in-python). Note this comment: "You should not be manually converting on each input and output of your program; that's brittle and completely unmaintainable." – warvariuc Oct 01 '15 at 05:23

Alastair McCormack · Accepted Answer · 2015-09-30T17:15:44.820

You're philosophy is correct but you're over complicating things and making your code brittle.

Open files in text mode to automatically convert to Unicode for you. Then print without encoding - print is supposed to work out the correct encoding.

If your Linux environment isn't set correctly, then set PYTHONIOENCODING=utf-8 in your Linux environment vars (export PYTHONIOENCODING=utf-8) to fix up any issues during print. You should consider setting your locale to a UTF-8 variation such as en_GB.UTF-8 to avoid having to define the PYTHONIOENCODING.

PyCharm should work without modification.

Your code should look like:

import os
import sys
import io

from myplugin import MyPlugin

if __name__ == '__main__':
    c = MyPlugin()
    # t is the default
    with io.open('myfile.txt', 'rt', encoding='utf-8') as myfile:
        # a is now a Unicode string
        a = myfile.read()

    result = c.generate(a)
    print result

If you're using Python 3.x, drop import io and io. from io.open().

Python 2.7 unicode confusion again

2 Answers2