7

I got the point about unicode, encoding and decoding. But I don't understand why the encode function works on str type. I expected it to work only on unicode type. Therefore my question is : what is the behavior of encode when it's used on a str rather than unicode ?

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251
Ali Baba
  • 350
  • 1
  • 4
  • 12

3 Answers3

9

In Python 2 there are two types of codecs available; those that convert between str and unicode, and those that convert from str to str. Examples of the latter are the base64 and rot13 codecs.

The str.encode() method exists to support the latter:

'binary data'.encode('base64')

But now that it exists, people are also using it for the unicode -> str codecs; encoding can only go from unicode to str (and decoding the other way). To support these, Python will implicitly decode your str value to unicode first, using the ASCII codec, before finally encoding.

Incidentally, when using a str -> str codec on a unicode object, Python first implicitly encodes to str using the same ASCII codec.

In Python 3, this has been solved by a) removing the bytes.encode() and str.decode() methods (remember that bytes is sorta the old str and str the new unicode), and b) by moving the str -> str encodings to the codecs module only, using the codecs.encode() and codecs.decode() functions. What codecs transform between the same type has also been clarified and updated, see the Python Specific Encodings section; note that the 'text' encodings noted there, where available in Python 2, encode to str instead.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
4

Python realizes that it can't do an encode on a str type, so it tries to decode it first! It uses the 'ascii' codec, which will fail if you have any characters with a codepoint above 0x7f.

This is why you sometimes see a decode error raised when you were trying to do an encode.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • nitpick: it uses `sys.getdefaultencoding()` (which is almost always `'ascii'`) – wim Feb 26 '16 at 21:58
  • @wim thanks for that, I didn't know it - I've never seen `sys.getdefaultencoding` return anything other than `'ascii'`. – Mark Ransom Feb 26 '16 at 22:00
  • @MarkRansom: that's because `sys.setdefaultencoding` is removed by `site.py`. `reload(sys)` will bring it back, but setting the default to anything but `ascii` is a [*very bad idea*](https://stackoverflow.com/questions/28657010/dangers-of-sys-setdefaultencodingutf-8). You often see `import sys; reload(sys); sys.setdefaultencoding(...)` cargo-culted on questions about Unicode problems. – Martijn Pieters Feb 26 '16 at 23:18
3

In Python 3, encoding a bytestring simply does not work.

>>> b'hi'.encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'bytes' object has no attribute 'encode'

Python 2 tries to be helpful when you call encode on a str and first tries to decode the string with sys.getdefaultencoding() (usually ascii) and afterwards encode it.

That's why you get the the rather weird error message that decoding with ascii is not possible when you try to encode with utf-8.

>>> 'hi\xFF'.encode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 2: ordinal not in range(128)

Ned explains it better than I, watch this from 16:20 onward.

timgeb
  • 76,762
  • 20
  • 123
  • 145