I thought there was another question like this on SO, but specifically for a C shared library module; this answer may have been more appropriate there, but I cannot find the link now :)
In brief, my answer would be - try locale.setlocale(locale.LC_ALL, '')
before you load the module (I haven't used cmd
myself yet). In more detail:
I was trying to use the SWIG Python bindings for Subversion (SVN). These are basically an automatic interface for Python produced by SWIG, directly from SVN C library code (libsvn1
). When I run svn status MyWorkingCopy
from the terminal, it hooks into libsvn
code - and it hasn't failed for years now (for that repository). But, when I ran the Python example (doing the same thing as svn status
) - which hooks into the same libsvn
code - from the same terminal, then I'd get a UTF-8 related error from libsvn/SWIG, which would crash my Python script.
This means that Python somehow "influenced" the library to behave otherwise in respect to character sets. But my terminal persistently reports:
$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US:en
LC_CTYPE="en_US.UTF-8"
...
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
So, it's not about what the terminal/shell (bash
in this case) think - it's about what the underlying C code (of libsvn
in this case) thinks about current settings. And the same, I thought, applies to python:
$ python -c 'import locale; print locale.getdefaultlocale()'
('en_US', 'UTF-8')
So, now it's about seeing what C code sees, when ran from terminal vs. when ran from Python (in the same terminal). Debugging libsvn
further, it turned out it actually comes from another library, libapr
(Apache Portable Runtime), which SVN uses for memory allocation. What I ended up doing, is writing a repeat of the string copying done by libsvn
which uses libapr
in a standalone C program; and then built it via SWIG as a Python module. This program, aprtest
, accepts a string as argument, invokes the libapr
engine to copy it, and displays the result; the source for it is posted here:
See the script build-aprtest.sh for library versions I worked with (Ubuntu 11.04); to build, run bash build-aprtest.sh
.
Now, if you run the executable thus built in terminal, you get:
$ locale
LANG=en_US.UTF-8
...
$ ./aprtest "test"
LC_CTYPE 0 CODESET 14
ANSI_X3.4-1968
apr_xlate_open: apr_err=0
apr_xlate_conv_buffer apr_err == 0
(*dest)->data: test
$ ./aprtest "test東京"
LC_CTYPE 0 CODESET 14
ANSI_X3.4-1968
apr_xlate_open: apr_err=0
apr_xlate_conv_buffer apr_err == 22
The libapr
engine clearly failed on UTF-8 input from command line, in spite of the terminal reporting UTF-8
. And when we run as a shared module (called aprtest_s
) through Python:
$ python -c 'import aprtest_s; aprtest_s.pysmain("test")'
LC_CTYPE 0 CODESET 14
ANSI_X3.4-1968
apr_xlate_open: apr_err=0
apr_xlate_conv_buffer apr_err == 0
(*dest)->data: test
$ python -c 'import aprtest_s; aprtest_s.pysmain("test東京")'
LC_CTYPE 0 CODESET 14
ANSI_X3.4-1968
apr_xlate_open: apr_err=0
apr_xlate_conv_buffer apr_err == 22
... the same happens (btw, for the same issue with SVN and APR, but for Perl, see Is there a variable or function that returns the native platform encoding (APR_LOCALE_CHARSET)). So we can conclude:
- It doesn't matter if the C program is ran directly from terminal, or through Python - the C program simply sees a different locale/encoding settings, from what the calling program may see
- There is no problem with ASCII strings, only the UTF-8 ones
So, how then does the svn client work properly from terminal, while ultimately using libapr
without crashing? Well, as it can be seen it the comments of source for aprtest_s.c; it is by setting the program's own locale, by using the C function setlocale(LC_CTYPE,"")
, which, it turns out, sets all the categories of the locale of the process. This issue is actually mentioned in apr-dev mailing list: Re: Misbehaviour of apr_os_locale_encoding on Windows:
... this picking of one
of the 55 different current locales can probably only be properly done
by the application, not by APR.
So, by coding setlocale()
in the C application, we apparently pick the default locale explicitly, so libapr
knows about it. In the test case, this call to setlocale
must happen before a call to apr_xlate_open
.
Now, the posted version of aprtest
doesn't do setlocale
, so we can see what is happening from Python (note also this), when we use the Python version, locale.setlocale()
:
$ PYTHONIOENCODING='utf-8' echo 'import sys;print sys.stdin.encoding' | python
None
$ echo 'import sys;print sys.stdin.encoding' | PYTHONIOENCODING='utf-8' python
utf-8
$ locale
LANG=en_US.UTF-8
LANGUAGE=en_US:en
...
$ python
Python 2.7.1+ (r271:86832, Sep 27 2012, 21:16:52)
[GCC 4.5.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>> import aprtest_s
>>> aprtest_s.print_locale()
LC_CTYPE 0 CODESET 14
ANSI_X3.4-1968
>>> aprtest_s.pysmain("test")
LC_CTYPE 0 CODESET 14
ANSI_X3.4-1968
apr_xlate_open: apr_err=0
apr_xlate_conv_buffer apr_err == 0
(*dest)->data: test
>>> aprtest_s.pysmain("test東京")
LC_CTYPE 0 CODESET 14
ANSI_X3.4-1968
apr_xlate_open: apr_err=0
apr_xlate_conv_buffer apr_err == 22
>>> import locale
>>> print locale.getdefaultlocale()
('en_US', 'UTF-8')
>>> print locale.getlocale()
(None, None)
>>> import sys
>>> print sys.stdin.encoding
UTF-8
>>> locale.setlocale(locale.LC_ALL, '')
'en_US.UTF-8'
>>> print sys.stdin.encoding
UTF-8
>>> print locale.getlocale()
('en_US', 'UTF-8')
>>> aprtest_s.pysmain("test")
LC_CTYPE 0 CODESET 14
UTF-8
apr_xlate_open: apr_err=0
apr_xlate_conv_buffer apr_err == 0
(*dest)->data: test
>>> aprtest_s.pysmain("test東京")
LC_CTYPE 0 CODESET 14
UTF-8
apr_xlate_open: apr_err=0
apr_xlate_conv_buffer apr_err == 0
(*dest)->data: test東京
>>>
Thus, to make sure what is it, that a C application sees in Python - use locale.getlocale()
(NOT locale.getdefaultlocale()
). The way I understand it now, getdefaultlocale
returns some OS/user settings saved somewhere which are considered to be default, but necessarily applied as default when an application starts; and getlocale
gets the actual, currently applied locale settings. And I guess, when we call setlocale
with empty string, that causes the rest of the code to: read the default settings (those given by getdefaultlocale
), and then apply the default settings as current.
And a s final note - even though it looks related, the encoding settings of stdin
/stdout
(apparently) have nothing to do with the encoding of the current locale (at least as seen by a C program running in that environment).
Hope this helps someone,
Cheers!