Python's handling of shell strings

Question

I still do not understand completely how python's unicode and str types work. Note: I am working in Python 2, as far as I know Python 3 has a completely different approach to the same issue.

What I know:

str is an older beast that saves strings encoded by one of the way too many encodings that history has forced us to work with.

unicode is an more standardised way of representing strings using a huge table of all possible characters, emojis, little pictures of dog poop and so on.

The decode function transforms strings to unicode, encode does the other way around.

If I, in python's shell, simply say:

>>> my_string = "some string"

then my_string is a str variable encoded in ascii (and, because ascii is a subset of utf-8, it is also encoded in utf-8).

Therefore, for example, I can convert this into a unicode variable by saying one of the lines:

>>> my_string.decode('ascii')
u'some string'  
>>> my_string.decode('utf-8')
u'some string'

What I don't know:

How does Python handle non-ascii strings that are passed in the shell, and, knowing this, what is the correct way of saving the word "kožušček"?

For example, I can say

>>> s1 = 'kožušček'

In which case s1 becomes a str instance that I am unable to convert into unicode:

>>> s1='kožušček'
>>> s1
'ko\x9eu\x9a\xe8ek'
>>> print s1
kožušček
>>> s1.decode('ascii')

Traceback (most recent call last):
  File "<pyshell#23>", line 1, in <module>
    s1.decode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x9e in position 2: ordinal not in range(128)

Now, naturally I can't decode the string with ascii, but what encoding should I then use? After all, my sys.getdefaultencoding() returns ascii! Which encoding did Python use to encode s1 when fed the line s1=kožušček?

Another thought I had was to say

>>> s2 = u'kožušček'

But then, when I printed s2, I got

>>> print s2
kouèek

which means that Python lost a whole letter. Can someone explain this to me?

You mean the *interactive interpreter*. It reads from the `stdin` stream, and it is your console or terminal that does the encoding here. — Martijn Pieters, Jul 30 '15 at 07:44
Could you please specify if you are talking about python2 or python3? — MadMike, Jul 30 '15 at 07:48
@MartijnPieters even though this is clear to the experts among the readers, this should still be mentioned in the question — MadMike, Jul 30 '15 at 07:49
@MadMike Python 2 and 3 are not THAT different in this way: except that you call it `unicode` and `str` in python 2 and `str` and `bytes` in python 3 (and ordinary string literals are `str` in both). Therefore you'd know that it's python2 (because you're trying to decode `str` and not `bytes` - and mentioning `unicode`). — skyking, Jul 30 '15 at 08:03
related: [Why does Python print unicode characters when the default encoding is ASCII?](http://stackoverflow.com/q/2596714/4279) — jfs, Jul 30 '15 at 22:15

Martijn Pieters · Accepted Answer · 2015-07-30T08:00:46.660

str objects contain bytes. What those bytes represent Python doesn't dictate. If you produced ASCII-compatible bytes, you can decode them as ASCII. If they contain bytes representing UTF-8 data they can be decoded as such. If they contain bytes representing an image, then you can decode that information and display an image somewhere. When you use repr() on a str object Python will leave any bytes that are ASCII printable as such, the rest are converted to escape sequences; this keeps debugging such information practical even in ASCII-only environments.

Your terminal or console in which you are running the interactive interpreter writes bytes to the stdin stream that Python reads from when you type. Those bytes are encoded according to the configuration of that terminal or console.

In your case, your console encoded the input you typed to a Windows codepage, most likely. You'll need to figure out the exact codepage and use that codec to decode the bytes. Codepage 1252 seems to fit:

>>> print 'ko\x9eu\x9a\xe8ek'.decode('cp1252')
kožušèek

When you print those same bytes, your console is reading those bytes and interpreting them in the same codec it is already configured with.

Python can tell you what codec it thinks your console is set to; it tries to detect this information for Unicode literals, where the input has to be decoded for you. It uses the locale.getpreferredencoding() function to determine this, and the sys.stdin and sys.stdout objects have an encoding attribute; mine is set to UTF-8:

>>> import sys
>>> sys.stdin.encoding
'UTF-8'
>>> import locale
>>> locale.getpreferredencoding()
'UTF-8'
>>> 'kožušèek'
'ko\xc5\xbeu\xc5\xa1\xc3\xa8ek'
>>> u'kožušèek'
u'ko\u017eu\u0161\xe8ek'
>>> print u'kožušèek'
kožušèek

Because my terminal has been configured for UTF-8 and Python has detected this, using a Unicode literal u'...' works. The data is automatically decoded by Python.

Why exactly your console lost a whole letter I don't know; I'd have to have access to your console and do some more experiments, see the output of print repr(s2), and test all bytes between 0x00 and 0xFF to see if this is on the input or output side of the console.

I recommend you read up on Python and Unicode:

Pragmatic Unicode by Ned Batchelder
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
The Python Unicode HOWTO

It was cp1250, but thank you! Still, how does the second answer fit into this? Why does `u'kožušček'` produce such a mess? — 5xum, Jul 30 '15 at 07:59
Thanks for the links, I read most of them (especially the NO EXCUSES one), the problem I had was just the shell -> string part. It's clearer now. Thank you. — 5xum, Jul 30 '15 at 08:01
If this is in the Windows command prompt, then know that that console has huge problems with Unicode, at least in the way Python interacts with it, and the default font choices made by Microsoft. — Martijn Pieters, Jul 30 '15 at 08:03
@5xum: Are you using by any chance IDLE? There is a bug that [IDLE uses `latin-1` instead of your locale encoding to decode Unicode literals](https://bugs.python.org/issue15809). A similar bug may be present on Python 2 in other parts i.e., the error might happen at the reading of Unicode literals part too. What do you see if you run: `print u'ko\u017eu\u0161\xe8ek'` instead? (note: no non-ascii chars in the literal). Note: `cp1250` is (very likely) not your console encoding (Windows uses a different range). [Use `WriteConsoleW()`, to print Unicode](http://stackoverflow.com/a/3259271/4279) — jfs, Jul 30 '15 at 20:46

score 2 · Answer 2 · answered Jul 30 '15 at 08:18

Your system does not necessarily use the sys.getdefaultencoding() encoding; it is merely the default used when you convert without telling it the encoding, as in:

>>> sys.getdefaultencoding()
'ascii'
>>> unicode(s1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc5 in position 2: ordinal not in range(128)

Python's idea of your system locale is in the locale module:

>>> import locale
>>> locale.getdefaultlocale()
('en_US', 'UTF-8')
>>> locale.getpreferredencoding()
'UTF-8'

And using this we can decode the string:

>>> u1=s1.decode(locale.getdefaultlocale()[1])
>>> u1
u'ko\u017eu\u0161\u010dek'
>>> print u1
kožušček

There's a chance the locale has not been set up, as is the case for the 'C' locale. That may cause the reported encoding to be None even though the default is 'ascii'. Normally figuring this out is the job of setlocale, which getpreferredencoding will automatically call. I would suggest calling it once in your program startup and saving the value returned for all further use. The encoding used for filenames may also be yet another case, reported in sys.getfilesystemencoding().

The Python-internal default encoding is set up by the site module, which contains:

def setencoding():
    """Set the string encoding used by the Unicode implementation.  The
    default is 'ascii', but if you're willing to experiment, you can
    change this."""
    encoding = "ascii" # Default value set by _PyUnicode_Init()
    if 0:
        # Enable to support locale aware default string encodings.
        import locale
        loc = locale.getdefaultlocale()
        if loc[1]:
            encoding = loc[1]
    if 0:
        # Enable to switch off string to Unicode coercion and implicit
        # Unicode to string conversion.
        encoding = "undefined"
    if encoding != "ascii":
        # On Non-Unicode builds this will raise an AttributeError...
        sys.setdefaultencoding(encoding) # Needs Python Unicode build !

So if you want it set by default in every run of Python, you can change that first if 0 to if 1.

Python's handling of shell strings

2 Answers2

Linked