From Dive into Python:
In Python 3, all strings are sequences of Unicode characters. There is no such thing as a Python string encoded in UTF-8, or a Python string encoded as CP-1252. “Is this string UTF-8?” is an invalid question. UTF-8 is a way of encoding characters as a sequence of bytes. If you want to take a string and turn it into a sequence of bytes in a particular character encoding, Python 3 can help you with that. If you want to take a sequence of bytes and turn it into a string, Python 3 can help you with that too. Bytes are not characters; bytes are bytes. Characters are an abstraction. A string is a sequence of those abstractions.
I don't understand what the author means by that.
When I say s = 'hello'
, how is s
encoded internally? Of course it must use some use some encoding. He says all strings are sequences of Unicode characters. But how many bytes is each character? Is this string UTF-8? Why does he say : "There is no such thing as a Python string encoded in UTF-8".
I understand Python provides capabilities of converting a Python "string" into a series of bytes that can be read by another software that uses that encoding. It also supports conversion of a series of bytes into a Python "string". Now the internal representation of this "string" is what confuses me.