I still do not understand completely how python's unicode and str types work. Note: I am working in Python 2, as far as I know Python 3 has a completely different approach to the same issue.
What I know:
str
is an older beast that saves strings encoded by one of the way too many encodings that history has forced us to work with.
unicode
is an more standardised way of representing strings using a huge table of all possible characters, emojis, little pictures of dog poop and so on.
The decode
function transforms strings to unicode, encode
does the other way around.
If I, in python's shell, simply say:
>>> my_string = "some string"
then my_string
is a str
variable encoded in ascii
(and, because ascii is a subset of utf-8, it is also encoded in utf-8
).
Therefore, for example, I can convert this into a unicode
variable by saying one of the lines:
>>> my_string.decode('ascii')
u'some string'
>>> my_string.decode('utf-8')
u'some string'
What I don't know:
How does Python handle non-ascii strings that are passed in the shell, and, knowing this, what is the correct way of saving the word "kožušček"
?
For example, I can say
>>> s1 = 'kožušček'
In which case s1
becomes a str
instance that I am unable to convert into unicode
:
>>> s1='kožušček'
>>> s1
'ko\x9eu\x9a\xe8ek'
>>> print s1
kožušček
>>> s1.decode('ascii')
Traceback (most recent call last):
File "<pyshell#23>", line 1, in <module>
s1.decode('ascii')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x9e in position 2: ordinal not in range(128)
Now, naturally I can't decode the string with ascii
, but what encoding should I then use? After all, my sys.getdefaultencoding()
returns ascii
! Which encoding did Python use to encode s1
when fed the line s1=kožušček
?
Another thought I had was to say
>>> s2 = u'kožušček'
But then, when I printed s2
, I got
>>> print s2
kouèek
which means that Python lost a whole letter. Can someone explain this to me?