Returning a Unicode string vs. Returning a normal string encoded as UTF-8?

Question

On the tutorial page for the Django web framework, the author explains why adding a __unicode__() method is preferred than a __str__() with the following reason:

Django models have a default __str__() method that calls __unicode__() and converts the result to a UTF-8 bytestring. This means that unicode(p) will return a Unicode string, and str(p) will return a normal string, with characters encoded as UTF-8.

I don't understand what's the difference between a Unicode string and a string with characters encoded as UTF-8. I thought UTF-8 is one of the encodings for Unicode?

http://utf8everywhere.org – Pavel Radzivilovsky Aug 01 '13 at 19:56 — Pavel Radzivilovsky, Aug 01 '13 at 19:56

score 3 · Answer 1 · edited May 23 '17 at 12:14

Python Unicode objects are abstract - they represent a sequence of Unicode code points independent of any particular encoding. A UTF-8 encoded string, on the other hand, is a sequence of bytes that encodes a sequence of Unicode code points. They're different levels of abstraction.

You can think of code points as being like an abstract number, and an encoding as being like a particular binary representation of that number. A Unicode object represents the "number" (actually the codepoints), while a string represents the binary. This analogy is not exact, but if you're already used to the idea that, say, an object to represent the integer "8" is different from an object to represent the specific bit sequence "00001000" it may prove clarifying. Especially if you've worked with systems like twos-complement, where the bit sequence that represents the abstract integer "8" would be different.

This essay, while now almost ten years old, is still one of the clearest and most comprehensive explanations of the concepts I've ever run into.

This answer is pretty good on the Python-specific details.

Nitpick: the representation of positive numbers in two's complement is not different from their normal base-2 representation (except for the leading zeros). — R. Martinho Fernandes, Aug 01 '13 at 08:37
I knew I should have looked that up. But I think the concept still applies. — Peter DeGlopper, Aug 01 '13 at 08:47

Eric Urban · Answer 2 · 2013-08-01T04:12:13.847

0

UTF-8 is an encoding of the entire Unicode character set. It is backwards compatible with ASCII. For characters outside the ASCII set, multibyte encodings are used.

All ASCII strings are Unicode strings. All Unicode strings are not ASCII strings.

edited Aug 01 '13 at 04:12

answered Aug 01 '13 at 04:06

Eric Urban

3,671
1
18
23

I'm sorry, but this is wrong. UTF-8 is an encoding capable of representing the entire Unicode character set. ASCII is a subset of UTF-8, but they are not identical. – Peter DeGlopper Aug 01 '13 at 04:09
Ah yes, sadly true. One day the earth will be cleansed of the scourge that is multiple byte encodings. – Eric Urban Aug 01 '13 at 04:12
@EricUrban, so how do you propose we represent languages other than English, if multiple byte encodings are a scourge? Do you have a way of representing all the scripts of all the languages of the world inside 8 bits? – JoelFan Feb 10 '23 at 00:04

score 0 · Answer 3 · answered Aug 01 '13 at 04:07

0

In Python, Unicode Strings are stored internally as UCS-2 or UCS-4/UTF-32, which are a 16/32-bit fixed-length types, respectively. UTF-8 on the other hand is a variable-length bit type, starting at (padded) 8-bit and going up 32-bit (31 used bits) for code-points exceeding the basic ASCII table.

answered Aug 01 '13 at 04:07

bossi

1,629
13
17

The "stored internally as" part is no longer valid for Python 3.3. – Matthias Aug 01 '13 at 05:45

score 0 · Answer 4 · answered Aug 01 '13 at 04:09

0

You may want to dive into this sea of Unicode documentation.

http://www.utf-8.com/

You'll find that UTF-8 is the new Unicode.

answered Aug 01 '13 at 04:09

KarTo

98
7

Returning a Unicode string vs. Returning a normal string encoded as UTF-8?

4 Answers4