I got the point about unicode, encoding and decoding. But I don't understand why the encode function works on str type. I expected it to work only on unicode type. Therefore my question is : what is the behavior of encode when it's used on a str rather than unicode ?
-
What do you think happens to unicode when it is encoded and decoded? – kojiro Feb 26 '16 at 21:48
-
2Use Python 3 and most of the confusion will be gone. – Alyssa Haroldsen Feb 26 '16 at 21:48
3 Answers
In Python 2 there are two types of codecs available; those that convert between str
and unicode
, and those that convert from str
to str
. Examples of the latter are the base64
and rot13
codecs.
The str.encode()
method exists to support the latter:
'binary data'.encode('base64')
But now that it exists, people are also using it for the unicode
-> str
codecs; encoding can only go from unicode
to str
(and decoding the other way). To support these, Python will implicitly decode your str
value to unicode
first, using the ASCII codec, before finally encoding.
Incidentally, when using a str
-> str
codec on a unicode
object, Python first implicitly encodes to str
using the same ASCII codec.
In Python 3, this has been solved by a) removing the bytes.encode()
and str.decode()
methods (remember that bytes
is sorta the old str
and str
the new unicode
), and b) by moving the str
-> str
encodings to the codecs
module only, using the codecs.encode()
and codecs.decode()
functions. What codecs transform between the same type has also been clarified and updated, see the Python Specific Encodings section; note that the 'text' encodings noted there, where available in Python 2, encode to str
instead.

- 1,048,767
- 296
- 4,058
- 3,343
-
Thanks for your answer. Besides I didn't know about base64 and rot13 ! – Ali Baba Feb 27 '16 at 09:53
Python realizes that it can't do an encode
on a str
type, so it tries to decode
it first! It uses the 'ascii'
codec, which will fail if you have any characters with a codepoint above 0x7f.
This is why you sometimes see a decode
error raised when you were trying to do an encode
.

- 299,747
- 42
- 398
- 622
-
nitpick: it uses `sys.getdefaultencoding()` (which is almost always `'ascii'`) – wim Feb 26 '16 at 21:58
-
@wim thanks for that, I didn't know it - I've never seen `sys.getdefaultencoding` return anything other than `'ascii'`. – Mark Ransom Feb 26 '16 at 22:00
-
@MarkRansom: that's because `sys.setdefaultencoding` is removed by `site.py`. `reload(sys)` will bring it back, but setting the default to anything but `ascii` is a [*very bad idea*](https://stackoverflow.com/questions/28657010/dangers-of-sys-setdefaultencodingutf-8). You often see `import sys; reload(sys); sys.setdefaultencoding(...)` cargo-culted on questions about Unicode problems. – Martijn Pieters Feb 26 '16 at 23:18
In Python 3, encoding a bytestring simply does not work.
>>> b'hi'.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: 'bytes' object has no attribute 'encode'
Python 2 tries to be helpful when you call encode
on a str
and first tries to decode the string with sys.getdefaultencoding()
(usually ascii) and afterwards encode it.
That's why you get the the rather weird error message that decoding with ascii is not possible when you try to encode with utf-8.
>>> 'hi\xFF'.encode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 2: ordinal not in range(128)
Ned explains it better than I, watch this from 16:20 onward.

- 76,762
- 20
- 123
- 145