Python2: what do str.encode() and unicode.decode() do?

Question

From this question and its answers Python str vs unicode types I understood that unicode.encode() gives you str and str.decode() gives you unicode:

a = 'à'
ua = u'à'
print type(a)  # str
print type(ua)  # unicode
print ua.encode('utf-8') == a  # True
print a.decode('utf-8') == ua  # True

But I don't understand the purpose of unicode.decode() and str.encode() methods. What are they supposed to return? How can I use them? Both following lines are failing with UnicodeDecodeError or UnicodeEncodeError:

print ua.decode('utf-8')
print a.encode('utf-8')

Those methods (which were removed in Python 3 for good reasons) come from the fact that there is automatic coercion between `str` and `unicode`. In Python 2, those two types have the same interface. When you call `str.encode`, the `str` object is first coerced to `unicode` under the hood, and then encoded to `str`. Analogously for `unicode.decode` — lenz, Mar 18 '20 at 12:18

chepner · Accepted Answer · 2020-03-18T12:16:40.673

TL;DR Using unicode.decode and str.encode means you aren't using the right types to represent your data. The methods on the equivalent types in Python 3 don't even exist.

A unicode value is a single Unicode code point: an integer interpreted as a particular character. A str, on the other hand, is a sequence of bytes.

For example, à is Unicode code point U+00E0. The UTF-8 encoding represents it with a pair of bytes, 0xC3 and 0xA0.

The unicode.encode method takes a Unicode string (a sequence of code points) and returns the byte-level encoding of each code point as a single byte string.

>>> ua.encode('utf-8')
'\xc3\xa0'

str.decode takes a byte string and attempts to produce the equivalent Unicode value.

>>> '\xc3\xa0'.decode('utf-8')
u'\xe0'

(u'\xe0' is equivalent to u'à').

As for your errors: Python 2 doesn't enforce a strict separation between how unicode and str are used. It doesn't really make sense to encode a str if it is already an encoded value, and it doesn't make sense to decode a unicode value because it's not encoded in the first place. Rather than pick apart exactly how the errors occur, I'll just point out that in Python 3, there are two types: bytes is a string of bytes (corresponding to Python 2 str), and str is a Unicode string (corresponding to Python 2 unicode). The "nonsensical" methods don't even exist in Python 3:

>>> bytes.encode
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: type object 'bytes' has no attribute 'encode'
>>> str.decode
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: type object 'str' has no attribute 'decode'

So your attempts that raised Unicode*Error exceptions before now would simply raise an AttributeError.

If you are stuck supporting Python 2, just follow these rules:

unicode is for text
str is for binary data
unicode.encode produces a str value
str.decode produces a unicode value
If you find yourself trying to call str.encode, you are using the wrong type.
If you find yourself trying to call unicode.decode, you are using the wrong type.

Python2: what do str.encode() and unicode.decode() do?

1 Answers1