I see that the Python manual mentions .encode()
and .decode()
string methods. Playing around on the Python CLI I see that I can create unicode strings u'hello'
with a different datatype than a 'regular' string 'hello'
and can convert / cast with str()
. But the real problems start when using characters above ASCII 127 u'שלום'
and I am having a hard time determining empirically exactly what is happening.
Stack Overflow is overflowing with examples of confusion regarding Python's unicode and string-encoding/decoding handling.
What exactly happens (how are the bytes changed, and how is the datatype changed) when encoding and decoding strings with the str()
method, especially when characters that cannot be represented in 7 bytes are included in the string? Is it true, as it seems, that a Python variable with datatype <type 'str'>
can be both encoded and decoded? If it is encoded, I understand that means that the string is represented by UTF-8, ISO-8859-1, or some other encoding, is this correct? If it is decoded, what does this mean? Are decoded strings unicode? If so, then why don't they have the datatype <type 'unicode'>
?
In the interest of those who will read this later, I think that both Python 2 and Python 3 should be addressed. Thank you!