Unicode vs UTF-8 confusion in Python / Django?

Question

I stumbled over this passage in the Django tutorial:

Django models have a default str() method that calls unicode() and converts the result to a UTF-8 bytestring. This means that unicode(p) will return a Unicode string, and str(p) will return a normal string, with characters encoded as UTF-8.

Now, I'm confused because afaik Unicode is not any particular representation, so what is a "Unicode string" in Python? Does that mean UCS-2? Googling turned up this "Python Unicode Tutorial" which boldly states

Unicode is a two-byte encoding which covers all of the world's common writing systems.

which is plain wrong, or is it? I have been confused many times by character set and encoding issues, but here I'm quite sure that the documentation I'm reading is confused. Does anybody know what's going on in Python when it gives me a "Unicode string"?

score 54 · Accepted Answer · edited Oct 21 '09 at 10:11

54

what is a "Unicode string" in Python? Does that mean UCS-2?

Unicode strings in Python are stored internally either as UCS-2 (fixed-length 16-bit representation, almost the same as UTF-16) or UCS-4/UTF-32 (fixed-length 32-bit representation). It's a compile-time option; on Windows it's always UTF-16 whilst many Linux distributions set UTF-32 (‘wide mode’) for their versions of Python.

You are generally not supposed to care: you will see Unicode code-points as single elements in your strings and you won't know whether they're stored as two or four bytes. If you're in a UTF-16 build and you need to handle characters outside the Basic Multilingual Plane you'll be Doing It Wrong, but that's still very rare, and users who really need the extra characters should be compiling wide builds.

plain wrong, or is it?

Yes, it's quite wrong. To be fair I think that tutorial is rather old; it probably pre-dates wide Unicode strings, if not Unicode 3.1 (the version that introduced characters outside the Basic Multilingual Plane).

There is an additional source of confusion stemming from Windows's habit of using the term “Unicode” to mean, specifically, the UTF-16LE encoding that NT uses internally. People from Microsoftland may often copy this somewhat misleading habit.

edited Oct 21 '09 at 10:11

Hanno Fietz

30,799
47
148
234

answered Feb 07 '09 at 00:54

bobince

528,062
107
651
834

I think the difference between UCS-2 and UTF-16 is at least noteworthy, since one is fixed-length and the other isn't. If I care about the internal representation at all, I want to know that. – Hanno Fietz Oct 21 '09 at 10:00
Is it really UCS-2? Since Python may handle characters > `sys.maxunicode`, only that you may happen to slice characters in the middle. With UCS-2, how would it be possible to display/store/encode/decode characters above `sys.maxunicode`? (Tested with Python 3.1) – u0b34a0f6ae Dec 10 '09 at 16:00
It must be UTF-16, since UCS-2 does not support surrogate pairs. Demontration on narrow build of Python 3.1, breaking a character up in surrogates: `list(chr(sys.maxunicode + 1))`. The result is `['\ud800', '\udc00']`. Can someone confirm that on (narrow) Python 2 as well? – u0b34a0f6ae Dec 10 '09 at 16:08
Yes, Python2 also allows a single non-BMP character to be created as two surrogate code units via `unichr` or `\U00nnnnnn` string literal escape. So technically it is using UTF-16 with UCS-2 semantics. I hate having to use the term ‘UTF-16’ though, since it can mean either a series of 16-bit code units, or the big-or-little-endian byte-based encoding of the same, which causes a whole load of confusion. In practice all ‘UCS-2’ is really ‘UTF-16’ since the latter is a more commonly used superset of the former. – bobince Dec 10 '09 at 16:46
Does this mean the length of a Python string is **not** guaranteed to be the number of constituent Unicode code points? Aren't you guys all saying that depends on how you've built Python and whether the code points fall in the BMP or up in the astral planes? How can you write a portable Python program, then, one that behaves the same way no matter the build options or code points? Doesn't this break the model of abstract characters? In Perl, `length(chr(0x10345))` is guaranteed to be `1`. Are you saying in Python, it might be or might not happen to work out that way? **Isn't this a problem?** – tchrist Nov 02 '10 at 12:57
2

The length of a Python Unicode string in a narrow-Unicode build is the number of UTF-16 *code units*, not actual Unicode code points. Truncating and other slicing options that go by arbitrary index can indeed split a surrogate pair in half, with the result of some missing/replaced characters. On a narrow build, `unichr(0x10345)` simply fails; `len(u'\U00010345')` is `2`. This is the price you pay for easy interaction with Win32 UTF-16LE APIs. Most other environments use UCS-4 which doesn't suffer from any such problem. – bobince Nov 02 '10 at 20:15
It tends not to be a problem in reality because an invalid surrogate sequence doesn't result in any security issues (unlike UTF-8 overlongs), and Windows users can expect to face problems if they use characters outside the BMP, Python or not. Usage of the astral planes on Windows is still very rare. – bobince Nov 02 '10 at 20:17

score 9 · Answer 2 · edited Feb 09 '09 at 13:50

Meanwhile, I did a refined research to verify what the internal representation in Python is, and also what its limits are. "The Truth About Unicode In Python" is a very good article which cites directly from the Python developers. Apparently, internal representation is either UCS-2 or UCS-4 depending on a compile-time switch. So Jon, it's not UTF-16, but your answer put me on the right track anyway, thanks.

score 1 · Answer 3 · answered Aug 22 '08 at 12:03

1

Python stores Unicode as UTF-16. str() will return the UTF-8 representation of the UTF-16 string.

answered Aug 22 '08 at 12:03

Jonathan Works

1,826
1
17
13

1

Python stores Unicode strings as UTF-16 or UTF-32, depending on the platform and compile options. – tzot Feb 07 '09 at 19:18
On what platform does str(unicode_string) return UTF-8? Did you try it? e.g. str(u"\u0369") – tzot Feb 07 '09 at 19:20
3

Wrong on both counts. `str(unicode_val)` will encode according to `sys.getdefaultencoding()`. – Tobu Jan 31 '12 at 13:16

score -1 · Answer 4 · answered Aug 22 '08 at 12:10

From Wikipedia on UTF-8:

UTF-8 (8-bit UCS/Unicode Transformation Format) is a variable-length character encoding for Unicode. It is able to represent any character in the Unicode standard, yet the initial encoding of byte codes and character assignments for UTF-8 is backwards compatible with ASCII. For these reasons, it is steadily becoming the preferred encoding for e-mail, web pages[1], and other places where characters are stored or streamed.

So, it's anywhere between one and four bytes depending on which character you wish to represent within the realm of Unicode.

From Wikipedia on Unicode:

In computing, Unicode is an industry standard allowing computers to consistently represent and manipulate text expressed in most of the world's writing systems.

So it's able to represent most (but not all) of the world's writing systems.

I hope this helps :)

Ravi Chhabra · Answer 5 · 2008-08-25T14:01:34.597

so what is a "Unicode string" in Python?

Python 'knows' that your string is Unicode. Hence if you do regex on it, it will know which is character and which is not etc, which is really helpful. If you did a strlen it will also give the correct result. As an example if you did string count on Hello, you will get 5 (even if it's Unicode). But if you did a string count of a foreign word and that string was not a Unicode string than you will have much larger result. Pythong uses the information form the Unicode Character Database to identify each character in the Unicode String. Hope that helps.

Unicode vs UTF-8 confusion in Python / Django?

5 Answers5

Linked