When a str
is not prefixed by u''
in Python 2.7.x
, what the interpreter sees is a byte string, without an explicit encoding.
If you do not tell the interpreter what to do with those bytes when executing unicode()
, it will (as you saw) default to trying to decode
the bytes it sees via the ascii encoding scheme.
It does so as a preliminary step in trying to turn the plain bytes of the str
into a unicode
object.
Using ascii
to decode
means: try to interpret each byte of the str
using a hard-coded mapping, a number between 0
and 127
.
The error you encountered was like a dict
KeyError
: the interpreter encountered a byte for which the ascii
encoding scheme does not have a specified mapping.
Since the interpreter doesn't know what to do with the byte, it throws an error.
You can change that preliminary step by telling the interpreter to decode
the bytes using another set of encoding/decoding mappings instead, one that goes beyond ascii, such as UTF-8
, as elaborated in other answers.
If the interpreter finds a mapping in the chosen scheme for each byte (or bytes) in the str
, it will decode successfully, and the interpreter will use the resulting mappings to produce a unicode
object.
A Python unicode
object is a series of Unicode code points. There are 1,112,064 valid code points in the Unicode code space.
And if the scheme you choose for decoding is the one with which your text (or code points) were encoded, then the output when printing should be identical to the original text.
Can also consider trying Python 3
. The relevant difference is explained in the first comment below.