TL;DR Using unicode.decode
and str.encode
means you aren't using the right types to represent your data. The methods on the equivalent types in Python 3 don't even exist.
A unicode
value is a single Unicode code point: an integer interpreted as a particular character. A str
, on the other hand, is a sequence of bytes.
For example, à
is Unicode code point U+00E0. The UTF-8 encoding represents it with a pair of bytes, 0xC3 and 0xA0.
The unicode.encode
method takes a Unicode string (a sequence of code points) and returns the byte-level encoding of each code point as a single byte string.
>>> ua.encode('utf-8')
'\xc3\xa0'
str.decode
takes a byte string and attempts to produce the equivalent Unicode value.
>>> '\xc3\xa0'.decode('utf-8')
u'\xe0'
(u'\xe0'
is equivalent to u'à'
).
As for your errors: Python 2 doesn't enforce a strict separation between how unicode
and str
are used. It doesn't really make sense to encode a str
if it is already an encoded value, and it doesn't make sense to decode a unicode
value because it's not encoded in the first place. Rather than pick apart exactly how the errors occur, I'll just point out that in Python 3, there are two types: bytes
is a string of bytes (corresponding to Python 2 str
), and str
is a Unicode string (corresponding to Python 2 unicode
). The "nonsensical" methods don't even exist in Python 3:
>>> bytes.encode
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: type object 'bytes' has no attribute 'encode'
>>> str.decode
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
AttributeError: type object 'str' has no attribute 'decode'
So your attempts that raised Unicode*Error
exceptions before now would simply raise an AttributeError
.
If you are stuck supporting Python 2, just follow these rules:
unicode
is for text
str
is for binary data
unicode.encode
produces a str
value
str.decode
produces a unicode
value
- If you find yourself trying to call
str.encode
, you are using the wrong type.
- If you find yourself trying to call
unicode.decode
, you are using the wrong type.