2

From Dive into Python:

In Python 3, all strings are sequences of Unicode characters. There is no such thing as a Python string encoded in UTF-8, or a Python string encoded as CP-1252. “Is this string UTF-8?” is an invalid question. UTF-8 is a way of encoding characters as a sequence of bytes. If you want to take a string and turn it into a sequence of bytes in a particular character encoding, Python 3 can help you with that. If you want to take a sequence of bytes and turn it into a string, Python 3 can help you with that too. Bytes are not characters; bytes are bytes. Characters are an abstraction. A string is a sequence of those abstractions.

I don't understand what the author means by that.

When I say s = 'hello', how is s encoded internally? Of course it must use some use some encoding. He says all strings are sequences of Unicode characters. But how many bytes is each character? Is this string UTF-8? Why does he say : "There is no such thing as a Python string encoded in UTF-8".

I understand Python provides capabilities of converting a Python "string" into a series of bytes that can be read by another software that uses that encoding. It also supports conversion of a series of bytes into a Python "string". Now the internal representation of this "string" is what confuses me.

jabaldonedo
  • 25,822
  • 8
  • 77
  • 77
batman
  • 5,022
  • 11
  • 52
  • 82
  • `s` stores a reference to a String object.. and if you go all the way down, it's all just 0s and 1s.. in the middle, the String object somewhere contains some bytes and there is some code that understands the bytes, is it important what the bytes are, exactly? – Aprillion Sep 20 '13 at 09:45
  • @deathApril, no. But it wouldn't hurt to know the encoding. But what made me question that is the author saying "there is not such thing as a Python string encoded in UTF-8". Why not? If Python uses UTF-8 internally, why not say that? – batman Sep 20 '13 at 09:48

3 Answers3

4

Author compares strings in Python 2 and 3. In Python 2 strings were represented as byte arrays and thus introduced a lot of problems when dealing with non-ASCII characters. Programmer had to always keep track of current encoding of strings in their applications (e.g. encoding of the text on HTML page). There was an attempt to solve it in Python 2.x with introduction of Unicode objects:

s  = 'text'    # string/byte array object 
un = u'text'   # unicode object

But many application still used normal, old-style strings.

So, in Python 3 it was decided to separate strings (making them all Unicode) and byte arrays. Thus, in Python 3 we have:

s = 'text'                             # string/unicode object
b = bytes([0xA2,0x01,0x02,0x03,0x04])  # byte array object
ffriend
  • 27,562
  • 13
  • 91
  • 132
  • 2
    I keep thinking that this is one the best improvements in Python 3. – Diego Herranz Sep 20 '13 at 10:01
  • But why does the author say "there is not such thing as a Python string encoded in UTF-8..."? It *is* in some Unicode format isn't it? – batman Sep 20 '13 at 10:09
  • 2
    It's just a matter of terminology - programmer doesn't use encodings, he uses just strings. But yes, internally strings in Py3k are represented as some kind of Unicode. In fact, [Unicode objects internally use a variety of representations, in order to allow handling the complete range of Unicode characters while staying memory efficient](http://docs.python.org/3/c-api/unicode.html). – ffriend Sep 20 '13 at 10:35
  • 3
    @learner: UTF-8 is not Unicode! Strings in Python 3 are Unicode objects. If you *encode* that object to, say UTF-8, then you get a *bytes* object. – Tim Pietzcker Sep 20 '13 at 11:25
4

When I say s = 'hello', how is s encoded internally? Of course it must use some use some encoding.

It depends. Frankly, it doesn't matter. CPython now uses the Flexible String Representation, a wonderful space and time optimisation. But you shouldn't care because it doesn't matter.

He says all strings are sequences of Unicode characters. But how many bytes is each character?

Dunno. It depends. It'll probably be in Latin-1 (1 byte) (when using CPython) in that particular case.

Is this string UTF-8?

No.

Why does he say : "There is no such thing as a Python string encoded in UTF-8".

Because it's a series of Unicode Code points. If you confuse encodings with strings (as other languages often force you to do), you might think that 'Jalape\xc3\xb1o' is 'Jalapeño', because in UTF-8 the byte-sequence '\xc3\xb1o' represents 'ñ'. But it's not, because the string doesn't have an intrinsic encoding, just like the number 100 is the number 100, not 4, whether or not you represent it in binary, decimal or unary.

He says it because people come from languages where they only have bytes that represent strings and they think "but how is this encoded" as if they have to decode it themselves. It'd be like carrying a list of 1s and 0s instead of being able to use numbers, and you have to tell every function what endianness you're using.

I understand Python provides capabilities of converting a Python "string" into a series of bytes that can be read by another software that uses that encoding. It also supports conversion of a series of bytes into a Python "string". Now the internal representation of this "string" is what confuses me.

Hopefully it does not any more :).


If this confuses you, I reccomend this question, partially 'cause someone called my answer "superbly comprehensive"¹ but also because Steven D'Aprano has had one of his Python Mailing List excelencies posted there - he and I answered from the list and had our text posted across.

If you're wondering why it's relevant, I'll quote:

So the person you are quoting is causing confusion when he talks about an "encoded string", he should either make it clear he means a string of bytes, or not mention the word string at all.

Isn't that exactly your confusion?

¹ Technically he called another answer "another superbly comprehensive answer", but that implies what I just said ;).

Community
  • 1
  • 1
Veedrac
  • 58,273
  • 15
  • 112
  • 169
0

Python uses UCS-2 or UCS-4 encoding internally for unicode strings (at least in Python 2.x).