Using utf-8 input for cmd Python module

Question

In the process of creating a small CLI notebook application, I decided to go with the cmd Python library (see also cmd on PyMOTW).

My shell is UTF-8.

→ echo $LANG
fr_FR.utf-8
→ echo $LC_ALL
fr_FR.utf-8

And it is working quite well.

→ echo "東京"
東京

Starting the code of my little app and trying to use utf-8:

→ python nb.py 
log> foobar
2013-01-15 foobar
log> æ±äº¬
2013-01-15 æ±äº¬

Edited The expected input/output is. When I type utf-8 characters, be accent or Japanese characters in that case, I get garbage.

log> 東京
2013-01-15 東京

So when starting the program the command line changes the type of the input.

#!/usr/bin/env python2.7
# encoding: utf-8
import datetime
import os.path
import logging
import cmd

ROOT = "~/test/"
NOTENAME = "notes.md"

def todaynotepath(rootpath, notename):
    isodate = datetime.date.today().isoformat()
    isodate.replace("-", "/")
    return rootpath + isodate.replace("-", "/") + "/%s" % (notename)

def addcontent(content):
    logging.info(content)

class NoteBook(cmd.Cmd):
    """Simple cli notebook."""
    prompt = "log> "

    def precmd(self, line):
        # What is the date path NOW
        notepath = todaynotepath(ROOT, NOTENAME)
        # if the directory of the note doesn't exist, create it.
        notedir = os.path.dirname(notepath)
        if not os.path.exists(notedir):
            os.makedirs(notedir)
        # if the file for notes today doesn't exist, create it.
        logging.basicConfig(filename=notepath, level=logging.INFO, format='%(asctime)s - %(message)s')
        return cmd.Cmd.precmd(self, line)

    def default(self, line):
        if line:
            print datetime.date.today().isoformat(), line
            addcontent(line)

    def do_EOF(self, line):
        return True

    def postloop(self):
        print

if __name__ == "__main__":
    NoteBook().cmdloop()

So I guess there might be things to override in the original Class of cmd. I checked the module but without luck yet.

Edit 2: Added LESSCHARSET as recommended by @dda

LANG=fr_FR.utf-8
LANGUAGE=fr_FR.utf-8
LC_ALL=fr_FR.utf-8
LC_CTYPE=fr_FR.UTF-8
LESSCHARSET=utf-8

From your example output of your app it's not clear to me what is going wrong. Can show what you'd like to see. — Bernhard, Jan 15 '13 at 09:43
Unrelated note: this program creates a directory named `~` instead of creating a dir inside the user's home (which I expect is the intended behaviour). You can use `os.path.expanduser` to get the correct path from home. — lbonn, Jan 15 '13 at 09:50
@lbonn because I changed the real path to `~` for stackoverflow only. — karlcow, Jan 15 '13 at 12:30

score 2 · Answer 1 · answered Apr 12 '13 at 17:17

I thought there was another question like this on SO, but specifically for a C shared library module; this answer may have been more appropriate there, but I cannot find the link now :)

In brief, my answer would be - try locale.setlocale(locale.LC_ALL, '') before you load the module (I haven't used cmd myself yet). In more detail:

I was trying to use the SWIG Python bindings for Subversion (SVN). These are basically an automatic interface for Python produced by SWIG, directly from SVN C library code (libsvn1). When I run svn status MyWorkingCopy from the terminal, it hooks into libsvn code - and it hasn't failed for years now (for that repository). But, when I ran the Python example (doing the same thing as svn status) - which hooks into the same libsvn code - from the same terminal, then I'd get a UTF-8 related error from libsvn/SWIG, which would crash my Python script.

This means that Python somehow "influenced" the library to behave otherwise in respect to character sets. But my terminal persistently reports:

$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US:en
LC_CTYPE="en_US.UTF-8"
...
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

So, it's not about what the terminal/shell (bash in this case) think - it's about what the underlying C code (of libsvn in this case) thinks about current settings. And the same, I thought, applies to python:

$ python -c 'import locale; print locale.getdefaultlocale()'
('en_US', 'UTF-8')

So, now it's about seeing what C code sees, when ran from terminal vs. when ran from Python (in the same terminal). Debugging libsvn further, it turned out it actually comes from another library, libapr (Apache Portable Runtime), which SVN uses for memory allocation. What I ended up doing, is writing a repeat of the string copying done by libsvn which uses libapr in a standalone C program; and then built it via SWIG as a Python module. This program, aprtest, accepts a string as argument, invokes the libapr engine to copy it, and displays the result; the source for it is posted here:

http://sdaaubckp.sourceforge.net/dbg/swig-py/aprtest/

See the script build-aprtest.sh for library versions I worked with (Ubuntu 11.04); to build, run bash build-aprtest.sh.

Now, if you run the executable thus built in terminal, you get:

$ locale
LANG=en_US.UTF-8
...
$ ./aprtest "test"
LC_CTYPE 0 CODESET 14
ANSI_X3.4-1968
apr_xlate_open: apr_err=0
apr_xlate_conv_buffer apr_err == 0 
(*dest)->data: test
$ ./aprtest "test東京"
LC_CTYPE 0 CODESET 14
ANSI_X3.4-1968
apr_xlate_open: apr_err=0
apr_xlate_conv_buffer apr_err == 22

The libapr engine clearly failed on UTF-8 input from command line, in spite of the terminal reporting UTF-8. And when we run as a shared module (called aprtest_s) through Python:

$ python -c 'import aprtest_s; aprtest_s.pysmain("test")'
LC_CTYPE 0 CODESET 14
ANSI_X3.4-1968
apr_xlate_open: apr_err=0
apr_xlate_conv_buffer apr_err == 0 
(*dest)->data: test
$ python -c 'import aprtest_s; aprtest_s.pysmain("test東京")'
LC_CTYPE 0 CODESET 14
ANSI_X3.4-1968
apr_xlate_open: apr_err=0
apr_xlate_conv_buffer apr_err == 22

... the same happens (btw, for the same issue with SVN and APR, but for Perl, see Is there a variable or function that returns the native platform encoding (APR_LOCALE_CHARSET)). So we can conclude:

It doesn't matter if the C program is ran directly from terminal, or through Python - the C program simply sees a different locale/encoding settings, from what the calling program may see
There is no problem with ASCII strings, only the UTF-8 ones

So, how then does the svn client work properly from terminal, while ultimately using libapr without crashing? Well, as it can be seen it the comments of source for aprtest_s.c; it is by setting the program's own locale, by using the C function setlocale(LC_CTYPE,""), which, it turns out, sets all the categories of the locale of the process. This issue is actually mentioned in apr-dev mailing list: Re: Misbehaviour of apr_os_locale_encoding on Windows:

... this picking of one of the 55 different current locales can probably only be properly done by the application, not by APR.

So, by coding setlocale() in the C application, we apparently pick the default locale explicitly, so libapr knows about it. In the test case, this call to setlocale must happen before a call to apr_xlate_open.

Now, the posted version of aprtest doesn't do setlocale, so we can see what is happening from Python (note also this), when we use the Python version, locale.setlocale():

$ PYTHONIOENCODING='utf-8' echo 'import sys;print sys.stdin.encoding' | python
None
$ echo 'import sys;print sys.stdin.encoding' | PYTHONIOENCODING='utf-8' python
utf-8
$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US:en
...
$ python
Python 2.7.1+ (r271:86832, Sep 27 2012, 21:16:52) 
[GCC 4.5.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import aprtest_s
>>> aprtest_s.print_locale()
LC_CTYPE 0 CODESET 14
ANSI_X3.4-1968
>>> aprtest_s.pysmain("test")
LC_CTYPE 0 CODESET 14
ANSI_X3.4-1968
apr_xlate_open: apr_err=0
apr_xlate_conv_buffer apr_err == 0 
(*dest)->data: test
>>> aprtest_s.pysmain("test東京")
LC_CTYPE 0 CODESET 14
ANSI_X3.4-1968
apr_xlate_open: apr_err=0
apr_xlate_conv_buffer apr_err == 22 
>>> import locale
>>> print locale.getdefaultlocale()
('en_US', 'UTF-8')
>>> print locale.getlocale()
(None, None)
>>> import sys
>>> print sys.stdin.encoding
UTF-8
>>> locale.setlocale(locale.LC_ALL, '')
'en_US.UTF-8'
>>> print sys.stdin.encoding
UTF-8
>>> print locale.getlocale()
('en_US', 'UTF-8')
>>> aprtest_s.pysmain("test")
LC_CTYPE 0 CODESET 14
UTF-8
apr_xlate_open: apr_err=0
apr_xlate_conv_buffer apr_err == 0 
(*dest)->data: test
>>> aprtest_s.pysmain("test東京")
LC_CTYPE 0 CODESET 14
UTF-8
apr_xlate_open: apr_err=0
apr_xlate_conv_buffer apr_err == 0 
(*dest)->data: test東京
>>>

Thus, to make sure what is it, that a C application sees in Python - use locale.getlocale() (~~NOT locale.getdefaultlocale()~~). The way I understand it now, getdefaultlocale returns some OS/user settings saved somewhere which are considered to be default, but necessarily applied as default when an application starts; and getlocale gets the actual, currently applied locale settings. And I guess, when we call setlocale with empty string, that causes the rest of the code to: read the default settings (those given by getdefaultlocale), and then apply the default settings as current.

And a s final note - even though it looks related, the encoding settings of stdin/stdout (apparently) have nothing to do with the encoding of the current locale (at least as seen by a C program running in that environment).

Hope this helps someone,
Cheers!

score 1 · Accepted Answer · answered Jan 15 '13 at 09:51

1

Your code works perfectly for me, Karl. See this:

dda$ ./nb.py 
log> tagada
2013-01-15 tagada
log> 香港
2013-01-15 香港
log>

And the notes.md file contains the proper entries. So I don't think it's cmd that's at fault here, but probably something in your terminal settings. Try adding

export LESSCHARSET=utf-8

in your .profile.

answered Jan 15 '13 at 09:51

dda

6,030
2
25
34

ah interesting. I still have the issue even with `LESSCHARSET` set to the right value. But it means there is a difference in our config. – karlcow Jan 15 '13 at 12:53
what is the version of Python, you are using? And do you have any specific configuration for it? – karlcow Jan 16 '13 at 23:19
In that case I used the same Python 2.7 as you did. Specifically I have Python 2.7.2. Versions 2.5.4 and 2.6.1 didn't work. – dda Jan 17 '13 at 02:27
I guess I will accept this answer. Because it is working for you. I have to figure out what is not working for me :) – karlcow Jan 22 '13 at 15:00
I'm still on 10.6.8, but I don't think that's much of a difference. – dda Jan 24 '13 at 04:27

Using utf-8 input for cmd Python module

2 Answers2

Linked