Why UTF-32 instead of UTF-16 if we have surrogate pairs?

Question

If I understand correctly, UTF-32 can handle every character in the universe. So can UTF-16, through the use of surrogate pairs. So is there any good reason to use UTF-32 instead of UTF-16?

UTF-16 is helpful if majority of your characters are in 800-FFFF range which UTF-8 needs one additional byte for. UTF-32 doesn't make much sense. — Sedat Kapanoglu, Mar 09 '09 at 05:14
Not "in the Universe", only "on Earth" (and not even, see the Unicode FAQ). — PhiLho, Feb 28 '11 at 14:17
By the way, while UTF-16 can represent each currently mapped character through the use of surrogate pairs, the range that UTF-8 and UTF-32 cover is bigger. So, when we finish the 21 bits (about one million code points) that UTF-16 guarantees, we are in trouble. UTF-32 covers up to 32 bit, UTF-8 even more. — Andrea, Jun 13 '12 at 13:06
Surrogate pairs are a nuisance. If you need to know the length without parsing, and be able to cut arbitrary sequences of codepoints into substrings - you are more comfortable on UTF-32 (it's pretty much idiot-proof). And UTF-16 is "kinda-sorta-fixed-width", but popularized through Windows and MSVC (their wchar_t, the only way to get decent i18n support). — Tomasz Gandor, Oct 29 '14 at 10:50
@Andrea, from http://www.unicode.org/faq/utf_bom.html, quote: _"No. Both Unicode and ISO 10646 have policies in place that formally limit future code assignment to the integer range that can be expressed with current UTF-16 (0 to 1,114,111). Even if other encoding forms (i.e. other UTFs) can represent larger intergers, these policies mean that all encoding forms will always represent the same set of characters."_ — Abel, Nov 02 '14 at 01:17
Read this [Should UTF-16 be considered harmful?](http://programmers.stackexchange.com/q/102205/98103) and http://utf8everywhere.org/ — phuclv, Aug 13 '15 at 11:38

Raminder · Accepted Answer · 2010-04-25T05:14:43.807

12

In UTF-32 a unicode character would always be represented by 4 bytes so parsing code would be easier to write than that of a UTF-16 string because in UTF-16 a character is represented by varying number of bytes. On the downside a UTF-32 chatacter would always require 4 bytes which can be wasteful if you are working mostly with say english characters. So its a design choice depending upon your requirements whether to use UTF-16 or UTF-32.

edited Apr 25 '10 at 05:14

answered Mar 09 '09 at 04:38

Raminder

1,847
2
18
30

2

Actually UTF-32 is wasteful for most texts, not just for english characters. Because most living languages have all (or at least most) of their glyphs well within the range that doesn't require surrogate pairs in UTF-16. – Joachim Sauer Jul 19 '10 at 12:49
2

There was another reason for the Unicode Consortium to add the UTF-32 encoding: it helps to have a simple codepoint-to-string mapping that is one-on-one. With surrogate pairs (UTF-16) and the more complex UTF-8 there is no one-to-one mapping, a calculation is required. Using the Unicode tables and the mentioned codepoints, it is trivial, in fact, a no-op, to get to the character representation. Of course, this is handy in theory and in documentation, but in practice the space-waste is usually too big to resort to UTF-32. – Abel Nov 02 '14 at 01:07

score 9 · Answer 2 · answered Mar 09 '09 at 06:10

Someone might prefer to deal with UTF-32 instead of UTF-16 because dealing with surrogate pairs is pretty much always handling 'special-cases', and having to deal with those special cases means you have areas where bugs may creep in because you deal with them incorrectly (or more likely just forget to deal with them at all).

If the increased memory usage of UTF-32 is not an issue, the reduced complexity might be enough of an advantage to choose it.

score 5 · Answer 3 · edited Nov 13 '17 at 07:05

Here is a good documentation from The Unicode Consortium too.

Comparison of the Advantages of UTF-32, UTF-16, and UTF-8

Copyright © 1991–2009 Unicode, Inc. The Unicode Standard, Version 5.2

On the face of it, UTF-32 would seem to be the obvious choice of Unicode encoding forms for an internal processing code because it is a fixed-width encoding form. It can be conformantly bound to the C and C++ wchar_t, which means that such programming languages may offer built-in support and ready-made string APIs that programmers can take advan- tage of. However, UTF-16 has many countervailing advantages that may lead implementers to choose it instead as an internal processing code. While all three encoding forms need at most 4 bytes (or 32 bits) of data for each character, in practice UTF-32 in almost all cases for real data sets occupies twice the storage that UTF-16 requires. Therefore, a common strategy is to have internal string storage use UTF-16 or UTF-8 but to use UTF-32 when manipulating individual characters.

UTF-32 Versus UTF-16. On average, more than 99 percent of all UTF-16 data is expressed using single code units. This includes nearly all of the typical characters that software needs to handle with special operations on text—for example, format control characters. As a consequence, most text scanning operations do not need to unpack UTF-16 surrogate pairs at all, but rather can safely treat them as an opaque part of a character string. For many operations, UTF-16 is as easy to handle as UTF-32, and the performance of UTF- 16 as a processing code tends to be quite good. UTF-16 is the internal processing code of choice for a majority of implementations supporting Unicode. Other than for Unix plat- forms, UTF-16 provides the right mix of compact size with the ability to handle the occa- sional character outside the BMP. UTF-32 has somewhat of an advantage when it comes to simplicity of software coding design and maintenance. Because the character handling is fixed width, UTF-32 processing does not require maintaining branches in the software to test and process the double code unit elements required for supplementary characters by UTF-16. Conversely, 32-bit indices into large tables are not particularly memory efficient. To avoid the large memory penalties of such indices, Unicode tables are often handled as multistage tables (see “Multistage Tables” in Section 5.1, Transcoding to Other Standards). In such cases, the 32-bit code point values are sliced into smaller ranges to permit segmented access to the tables. This is true even in typical UTF-32 implementations. The performance of UTF-32 as a processing code may actually be worse than the perfor- mance of UTF-16 for the same data, because the additional memory overhead means that cache limits will be exceeded more often and memory paging will occur more frequently. For systems with processor designs that impose penalties for 16-bit aligned access but have very large memories, this effect may be less noticeable. In any event, Unicode code points do not necessarily match user expectations for “characters.” For example, the following are not represented by a single code point: a combining character sequence such as ; a conjoining jamo sequence for Korean; or the Devanagari conjunct “ksha.” Because some Unicode text pro- cessing must be aware of and handle such sequences of characters as text elements, the fixed-width encoding form advantage of UTF-32 is somewhat offset by the inherently vari- able-width nature of processing text elements. See Unicode Technical Standard #18, “Uni- code Regular Expressions,” for an example where commonly implemented processes deal with inherently variable-width text elements owing to user expectations of the identity of a “character.” UTF-8. UTF-8 is reasonably compact in terms of the number of bytes used. It is really only at a significant size disadvantage when used for East Asian implementations such as Chi- nese, Japanese, and Korean, which use Han ideographs or Hangul syllables requiring three- byte code unit sequences in UTF-8. UTF-8 is also significantly less efficient in terms of pro- cessing than the other encoding forms. Binary Sorting. A binary sort of UTF-8 strings gives the same ordering as a binary sort of Unicode code points. This is obviously the same order as for a binary sort of UTF-32 strings.

General Structure

All three encoding forms give the same results for binary string comparisons or string sort- ing when dealing only with BMP characters (in the range U+0000..U+FFFF). However, when dealing with supplementary characters (in the range U+10000..U+10FFFF), UTF-16 binary order does not match Unicode code point order. This can lead to complications when trying to interoperate with binary sorted lists—for example, between UTF-16 sys- tems and UTF-8 or UTF-32 systems. However, for data that is sorted according to the con- ventions of a specific language or locale rather than using binary order, data will be ordered the same, regardless of the encoding form.

@c4lil Please summarize your answer. Link-only answers are discouraged. — ReinstateMonica3167040, Nov 13 '17 at 01:48

MarkusQ · Answer 4 · 2009-03-09T04:58:40.697

4

Short answer: no.

Longer answer: yes, for compatibility with other things that didn't get the memo.

Less sarcastic answer: When you care more about speed of indexing than about space usage, or as an intermediate format of some sort, or on machines where alignment issues were more important than cache issues, or...

edited Mar 09 '09 at 04:58

answered Mar 09 '09 at 04:35

MarkusQ

21,814
3
56
68

score 4 · Answer 5 · answered Mar 09 '09 at 05:09

4

UTF-8 can also represent any unicode character!

If your text is mostly english, you can save a lot of space by using utf-8, but indexing characters is not O(1), because some characters take up more than just one byte.

If space is not as important to your situation as speed is, utf-32 would suit you better, because indexing is O(1)

UTF-16 can be better than utf-8 for non-english text because in utf-8 you have a situation where some characters take up 3 bytes, where as in utf16 they'd only take up two bytes.

answered Mar 09 '09 at 05:09

hasen

161,647
65
194
231

2

Apparently UTF-32 is programmatically faster, even if you would save alot of space using UTF-8, due to being able to process using a more efficient word size (ie, 32-bits, rather than handling each 8-bit chunk at a time) -though, with a (substantially) more complex UTF-8 library, that's a non-issue. – Arafangion Mar 09 '09 at 08:21

score 3 · Answer 6 · answered Mar 09 '09 at 08:56

There are probably a few good reasons, but one would be to speed up indexing / searching, i.e. in databases and the like.

With UTF-32 you know that each character is 4 bytes. With UTF-16 you don't know what length any particular character will be.

For example, you have a function that returns the nth char of a string:

char getChar(int index, String s );

If you are coding in a language that has direct memory access, say C, then in UTF-32 this function may be as simple as some pointer arithmatic (s+(4*index)), which would be some amounts O(1).

If you are using UTF-16 though, you would have to walk the string, decoding as you went, which would be O(n).

score 2 · Answer 7 · answered Jul 19 '10 at 12:58

In general, you just use the string datatype/encoding of the underlying platform, which is often (Windows, Java, Cocoa...) UTF-16 and sometimes UTF-8 or UTF-32. This is mostly for historical reasons; there is little difference between the three Unicode encodings: all three are well-defined, fast and robust, and all of them can encode every Unicode code point sequence. The unique feature of UTF-32 that it is a fixed-width encoding (meaning that each code point is represented by exactly one code unit) is of little use in practice: Your memory management layer needs to know about the number and width of code units, and users are interested in abstract characters and graphemes. As mentioned by the Unicode standard, Unicode applications have to deal with combined characters, ligatures and so on anyway and the handling of surrogate pairs, despite being conceptually different, can be done within the same technical framework.

If I were to reinvent the world, I'd probably go for UTF-32 because it is simply the least complex encoding, but as it stands the differences are too small to be of practical concern.

Why UTF-32 instead of UTF-16 if we have surrogate pairs?

7 Answers7