Lost with encodings (shell and accents)

Question

I'm having trouble with encodings. I'm using version

Python 2.7.2+ (default, Oct 4 2011, 20:03:08) [GCC 4.6.1] on linux2

I have chars with accents like é à. My scripts uses utf-8 encoding

#!/usr/bin/python
# -*- coding: utf-8 -*-

Users can type strings usings raw_input() with .

def rlinput(prompt, prefill=''):
    readline.set_startup_hook(lambda: readline.insert_text( prefill))
    try:
        return raw_input(prompt)
    finally:
        readline.set_startup_hook()

called in the main loop 'pseudo' shell

while to_continue : 
    to_continue, feedback = action( unicode(rlinput(u'todo > '),'utf-8') )
    os.system('clear')
    print T, u"\n" + feedback

Data are stored as pickle in files.

I managed to have the app working but finaly get stupid things like

core file :

class Task()
...
def __str__(self):
    r = (u"OK" if self._done else u"A faire").ljust(8) + self.getDesc()
    return r.encode('utf-8')

and so in shell file :

feedback = jaune + str(t).decode('utf-8') + vert + u" supprimée"

That's where i realize that i might be totaly wrong with encoding/decoding. So I tried to decode directly in rlinput but failed. I read some post in stackoverflow, re-read http://docs.python.org/library/codecs.html Waiting for my python book, i'm lost :/

I guess there is a lot of bad code but my question here is only related to encoding issus. You can find the code here : (most comments in french, sorry that's for personnal use and i'm a beginner, you'll also need yapsy - http://yapsy.sourceforge.net/ ) (then configure paths, then in py_todo : ./todo_shell.py) : http://bit.ly/rzp9Jm

To be clear: what do you exactly want to achieve? Or in other words, why did trying _"to decode directly in rlinput"_ fail? Could you perhaps state a use case in which you describe your actions and their expected outputs? I can't really find a question to answer in your current post... — jro, Nov 03 '11 at 10:15
**@jro** : the raw_input is ok, the result is set to a var. I got errors when displayind that inputed content : like this : UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 10: ordinal not in range(128). I know i can fix it but my is more like 'should i decode user inputs, decode at var setting time(), etc' @eryksun : i'll dig that tonight, thanks — 7seb, Nov 03 '11 at 11:03
@eryksun : I've just implemented and the __unicode__ method : Now everything is ok ! THANKS — 7seb, Nov 04 '11 at 23:41

score 2 · Accepted Answer · answered Nov 03 '11 at 17:15

2

Standard input and output are byte-based on all Unix systems. That's why you have to call the unicode function to get character-strings for them. The decode error indicates that the bytes coming in are not valid UTF-8.

Basically, the problem is the assumption of UTF-8 encoding, which is not guaranteed. Confirm this by changing the encoding in your unicode call to 'ISO-8859-1', or by changing the character encoding of your terminal emulator to UTF-8. (Putty supports this, in the "Translation" menu.)

If the above experiment confirms this, your challenge is to support the locale of the user and deduce the correct encoding, or perhaps to make the user declare the encoding in a command line argument or configuration. The $LANG environment variable is about the best you can do without an explicit declaration, and I find it to be a poor indicator of the desired character encoding.

answered Nov 03 '11 at 17:15

wberry

18,519
8
53
85

I'm logged into a Linux server now with my terminal's character encoding set to UTF-8. But `os.environ['LANG']` is `'en_US'` and therefore `sys.stdin.encoding` is `'ISO-8859-1'`, which is wrong. If I enter text, and rely on `sys.stdin.encoding` to decode the bytes, I will misinterpret the data. – wberry Nov 03 '11 at 23:35
That said, I'm not certain that GNU readline is not doing something related to encoding under the covers. Maybe it is, in which case my answer may not apply at all. – wberry Nov 03 '11 at 23:37
Shouldn’t you juse set the stream encoding to be UTF-8? Isn’t manual decoding virtually always the wrong answer? – tchrist Nov 04 '11 at 00:36
True enough; I checked [the code](http://hg.python.org/cpython/file/d1cde7081bf5/Python/pythonrun.c#l269). I didn't think about the environment being configured differently from the terminal. Is there no way to query the terminal to set LANG automatically? I mostly use Windows -- which has a different set of problems. – Eryk Sun Nov 04 '11 at 00:51
@eryksun: I always set my input and output encoding to UTF-8. If they don’t want that, they can use `iconv`. I despise programs that behave differently when run with redirection, or by a different user. They are inherently broken. Python’s guessing games with encodings is a pain in the @ss. Just set it to UTF-8 every time for deterministic and predictable behavior. Let them come to you, so to speak. – tchrist Nov 04 '11 at 01:07
Basically, this is another case of "old school" ASCII-centric thinking that hasn't caught up with Unicode. Unix is designed around ASCII. Stdio, the environment, the filesystem, process names, ... Everything is bytes with the assumption of ASCII. Even C's `char` type is really a byte. These kinds of issues are a direct result and the only real solution is to "fix" the OS. – wberry Nov 04 '11 at 14:04

score 0 · Answer 2 · edited May 23 '17 at 12:21

As @wberry suggested i checked the encodings : ok

$ file --mime-encoding todo_shell.py task.py todo.py
todo_shell.py: utf-8
task.py:       utf-8
todo.py:       utf-8
$ echo $LANG
fr_FR.UTF-8
$ python -c "import sys; print sys.stdin.encoding"
UTF-8

As @eryksun suggested as decoded the user inputs (+encode the strings submited before) (solved some problems if my memory is good) (will test deeply later) :

def rlinput(prompt, prefill=''):
readline.set_startup_hook(lambda: readline.insert_text( prefill.encode(sys.stdin.encoding) ))
try:
    return raw_input( prompt ).decode( sys.stdin.encoding )
finally:
    readline.set_startup_hook()

I still hav problems but my question was not well defined so i can't get clear answer. I feel less lost now and have directions to search. Thank you !

EDIT : i replaced the str methodes with unicode and it killed some (all?) probs.

Thanks @eryksun for the tips. (this links helped me : Python __str__ versus __unicode__ )

Lost with encodings (shell and accents)

2 Answers2