55

Okay. I know this looks like the typical "Why didn't he just Google it or go to www.unicode.org and look it up?" question, but for such a simple question the answer still eludes me after checking both sources.

I am pretty sure that all three of these encoding systems support all of the Unicode characters, but I need to confirm it before I make that claim in a presentation.

Bonus question: Do these encodings differ in the number of characters they can be extended to support?

phuclv
  • 37,963
  • 15
  • 156
  • 475
JohnFx
  • 34,542
  • 18
  • 104
  • 162

6 Answers6

71

There is no Unicode character that can be stored in one encoding but not another. This is simply because the valid Unicode characters have been restricted to what can be stored in UTF-16 (which has the smallest capacity of the three encodings). In other words, UTF-8 and and UTF-32 could be used to represent a wider range of characters than UTF-16, but they aren't. Read on for more details.

UTF-8

UTF-8 is a variable-length code. Some characters require 1 byte, some require 2, some 3 and some 4. The bytes for each character are simply written one after another as a continuous stream of bytes.

While some UTF-8 characters can be 4 bytes long, UTF-8 cannot encode 2^32 characters. It's not even close. I'll try to explain the reasons for this.

The software that reads a UTF-8 stream just gets a sequence of bytes - how is it supposed to decide whether the next 4 bytes is a single 4-byte character, or two 2-byte characters, or four 1-byte characters (or some other combination)? Basically this is done by deciding that certain 1-byte sequences aren't valid characters, and certain 2-byte sequences aren't valid characters, and so on. When these invalid sequences appear, it is assumed that they form part of a longer sequence.

You've seen a rather different example of this, I'm sure: it's called escaping. In many programming languages it is decided that the \ character in a string's source code doesn't translate to any valid character in the string's "compiled" form. When a \ is found in the source, it is assumed to be part of a longer sequence, like \n or \xFF. Note that \x is an invalid 2-character sequence, and \xF is an invalid 3-character sequence, but \xFF is a valid 4-character sequence.

Basically, there's a trade-off between having many characters and having shorter characters. If you want 2^32 characters, they need to be on average 4 bytes long. If you want all your characters to be 2 bytes or less, then you can't have more than 2^16 characters. UTF-8 gives a reasonable compromise: all ASCII characters (ASCII 0 to 127) are given 1-byte representations, which is great for compatibility, but many more characters are allowed.

Like most variable-length encodings, including the kinds of escape sequences shown above, UTF-8 is an instantaneous code. This means that, the decoder just reads byte by byte and as soon as it reaches the last byte of a character, it knows what the character is (and it knows that it isn't the beginning of a longer character).

For instance, the character 'A' is represented using the byte 65, and there are no two/three/four-byte characters whose first byte is 65. Otherwise the decoder wouldn't be able to tell those characters apart from an 'A' followed by something else.

But UTF-8 is restricted even further. It ensures that the encoding of a shorter character never appears anywhere within the encoding of a longer character. For instance, none of the bytes in a 4-byte character can be 65.

Since UTF-8 has 128 different 1-byte characters (whose byte values are 0-127), all 2, 3 and 4-byte characters must be composed solely of bytes in the range 128-256. That's a big restriction. However, it allows byte-oriented string functions to work with little or no modification. For instance, C's strstr() function always works as expected if its inputs are valid UTF-8 strings.

UTF-16

UTF-16 is also a variable-length code; its characters consume either 2 or 4 bytes. 2-byte values in the range 0xD800-0xDFFF are reserved for constructing 4-byte characters, and all 4-byte characters consist of two bytes in the range 0xD800-0xDBFF followed by 2 bytes in the range 0xDC00-0xDFFF. For this reason, Unicode does not assign any characters in the range U+D800-U+DFFF.

UTF-32

UTF-32 is a fixed-length code, with each character being 4 bytes long. While this allows the encoding of 2^32 different characters, only values between 0 and 0x10FFFF are allowed in this scheme.

Capacity comparison:

  • UTF-8: 2,097,152 (actually 2,166,912 but due to design details some of them map to the same thing)
  • UTF-16: 1,112,064
  • UTF-32: 4,294,967,296 (but restricted to the first 1,114,112)

The most restricted is therefore UTF-16! The formal Unicode definition has limited the Unicode characters to those that can be encoded with UTF-16 (i.e. the range U+0000 to U+10FFFF excluding U+D800 to U+DFFF). UTF-8 and UTF-32 support all of these characters.

The UTF-8 system is in fact "artificially" limited to 4 bytes. It can be extended to 8 bytes without violating the restrictions I outlined earlier, and this would yield a capacity of 2^42. The original UTF-8 specification in fact allowed up to 6 bytes, which gives a capacity of 2^31. But RFC 3629 limited it to 4 bytes, since that is how much is needed to cover all of what UTF-16 does.

There are other (mainly historical) Unicode encoding schemes, notably UCS-2 (which is only capable of encoding U+0000 to U+FFFF).

Community
  • 1
  • 1
Artelius
  • 48,337
  • 13
  • 89
  • 105
  • What's the RFC for the original UTF-8? – Pacerier Feb 01 '12 at 09:16
  • 5
    The marked correct answer appears to be flatout wrong.. This answer actually gives numbers, and is really thorough in its explanation. Awesome answer +1 – abelito Dec 03 '12 at 01:58
  • _UTF-8 cannot encode 2^32 characters. It's not even close._ Note that the old encoding supported (as you show) about 2^31 which is not close if you consider that 2 billion is a lot, but just a x2 difference is fairly close in computer software terms... – Alexis Wilke Jan 09 '14 at 22:35
  • So does that mean that if we would to store using up to 8 bits, we could represent about 4.3*10^12 (2^42) characters? – jeromej Apr 20 '14 at 20:10
  • UTF-8 was originally defined as allowing 2³¹ values as ISO 10646 at the time allowed for extension that far, but it was since brought in line with Unicode (which uses the same character set as that defined in ISO 10646) which does not allow for characters beyond U+10FFFF. – Jon Hanna May 26 '15 at 16:21
  • Thanks a lot for the detailed explanation! – ifyouseewendy Nov 21 '16 at 02:34
  • 1
    The second mention of UTF-16 in the answer should be UTF-32. I'd edit it myself, but the change is too short to be accepted. Thanks for the answer. – andypea May 30 '18 at 22:48
45

No, they're simply different encoding methods. They all support encoding the same set of characters.

UTF-8 uses anywhere from one to four bytes per character depending on what character you're encoding. Characters within the ASCII range take only one byte while very unusual characters take four.

UTF-32 uses four bytes per character regardless of what character it is, so it will always use more space than UTF-8 to encode the same string. The only advantage is that you can calculate the number of characters in a UTF-32 string by only counting bytes.

UTF-16 uses two bytes for most charactes, four bytes for unusual ones.

http://en.wikipedia.org/wiki/Comparison_of_Unicode_encodings

skoob
  • 1,411
  • 12
  • 10
  • 4
    "so it will always use more space than UTF-8" -- you mean more or equal space. – chazomaticus Nov 15 '08 at 01:06
  • 1
    and space cheapens every day. so the extra space utf-32 uses is not important. also to find the n'th character in utf-8 you need O(n), but in utf-32 you need only O(1), which is much faster! – Joschua May 15 '10 at 15:20
  • 5
    Slightly incorrect - UTF-8 uses anywhere between one and *six* bytes per character depending on the character you're encoding. – Arafangion Jan 25 '11 at 15:32
  • 12
    @Joschua ᴜᴛꜰ‑8&ᴜᴛꜰ‑16 are ***BOTH*** () to find the ᵗʰ character; only ᴜᴛꜰ‑32 alone is (). This perilous,pervasive,and pernicious problem plagues all proglangs⩙opsystems that violate the *Envelope of Abstraction* — &so code‐monkeys dirty their FᴜᴍʙʟᴇMɪᴛᴢᴇɴ on ᴜᴛꜰ‑16 code‐units instead of pure, abstract code points. People *constantly* screw this up; ᴇɢ﹕in Java **always use** `String.codePointCount`, `String.codePointAt`, &ᶜ — **never use** `String.length`, `String.charAt`, &ᶜ. See how all Java’s defaults are ꜰᴜʙᴀʀ? Use ᴜᴛꜰ‑8 ‖ ᴜᴛꜰ‑32, or go ᴍᴀᴅ dealing w/ᴜᴛꜰ‑16’s idioᵗsyncracies. – tchrist Jun 13 '11 at 02:35
  • 7
    You can count the number of _code points_ in a UTF-32 string just by counting how long the string is. That's not the same as _user perceived characters_, because some characters can use multiple code points. – bames53 Nov 10 '11 at 03:32
  • 12
    @Arafangion UTF-8 will not give 5 or 6 bytes for a Unicode code-point. The definitions for encoding code-points outside of Unicode are obsolete and not standard UTF-8. – Jon Hanna Jan 12 '12 at 01:06
  • 1
    @JonHanna: Looks like you're right, they are now limited to 4 bytes. – Arafangion Jan 12 '12 at 02:02
  • 1
    @tchrist Upvoting because yours has to be the _most elaborately formatted comment_ I've ever seen on SoF – chb Oct 23 '17 at 21:07
  • 1
    Just wanted to comment on the idea that "the extra space utf-32 uses is not important." I found the following in the Python 3.8.1 Unicode documentation, which makes a good point: "Increased RAM usage doesn’t matter too much (desktop computers have gigabytes of RAM, and strings aren’t usually that large), but expanding our usage of disk and network bandwidth by a factor of 4 is intolerable." – soporific312 Jan 26 '20 at 16:54
7

UTF-8, UTF-16, and UTF-32 all support the full set of unicode code points. There are no characters that are supported by one but not another.

As for the bonus question "Do these encodings differ in the number of characters they can be extended to support?" Yes and no. The way UTF-8 and UTF-16 are encoded limits the total number of code points they can support to less than 2^32. However, the Unicode Consortium will not add code points to UTF-32 that cannot be represented in UTF-8 or UTF-16. Doing so would violate the spirit of the encoding standards, and make it impossible to guarantee a one-to-one mapping from UTF-32 to UTF-8 (or UTF-16).

Derek Park
  • 45,824
  • 15
  • 58
  • 76
  • AFAIK, there are ways to extend UTF-8 to support 32 bits fully. With UTF-16, the limit of U+10FFFF is hard-wired and cannot be overcome without completely changing the way surrogate pairs work. – C. K. Young Sep 24 '08 at 23:15
  • It could originally cover 31 bits. That is the maximum that the encoding scheme can handle. (It has since been revised to cover only the Unicode code points, far less than 31 bits.) – Derek Park Sep 24 '08 at 23:27
  • More accurately, the original UTF-8 spec allowed for 31 bits, but was later restricted by RFC 3629 to 21 bits (with the highest codepoint restricted to U+10FFFF instead of U+1FFFFF) to maintain full compatibility with the UTF-16 encoding, not Unicode itself. – Remy Lebeau Sep 03 '13 at 00:23
5

I personally always check Joel's post about unicode, encodings and character sets when in doubt.

Atanas Korchev
  • 30,562
  • 8
  • 59
  • 93
  • 2
    Why not check unicode.org instead, which benefits from actually being correct about things. – Jon Hanna Jan 12 '12 at 01:13
  • Joel's post isn't intended as a reference for unicode, encodings, character sets, or any of that stuff. Rather, it is a posting stating what you must be *aware* of. – Arafangion Jan 12 '12 at 02:08
  • @JonHanna could you clarify what part of Joel's post is incorrect? – Yu Chen Mar 31 '20 at 16:13
4

All of the UTF-8/16/32 encodings can map all Unicode characters. See Wikipedia's Comparison of Unicode Encodings.

This IBM article Encode your XML documents in UTF-8 is very helpful, and indicates if you have the choice, it's better to choose UTF-8. Mainly the reasons are wide tool support, and UTF-8 can usually pass through systems that are unaware of unicode.

From the section What the specs say in the IBM article:

Both the W3C and the IETF have recently become more adamant about choosing UTF-8 first, last, and sometimes only. The W3C Character Model for the World Wide Web 1.0: Fundamentals states, "When a unique character encoding is required, the character encoding MUST be UTF-8, UTF-16 or UTF-32. US-ASCII is upwards-compatible with UTF-8 (an US-ASCII string is also a UTF-8 string, see [RFC 3629]), and UTF-8 is therefore appropriate if compatibility with US-ASCII is desired." In practice, compatibility with US-ASCII is so useful it's almost a requirement. The W3C wisely explains, "In other situations, such as for APIs, UTF-16 or UTF-32 may be more appropriate. Possible reasons for choosing one of these include efficiency of internal processing and interoperability with other processes."

Robert Paulson
  • 17,603
  • 5
  • 34
  • 53
  • The IBM urls are broken, I think it's meant to link to http://www.ibm.com/developerworks/xml/library/x-utf8/ – ninMonkey Oct 27 '16 at 14:02
2

As everyone has said, UTF-8, UTF-16, and UTF-32 can all encode all of the Unicode code points. However, the UCS-2 (sometimes mistakenly referred to as UCS-16) variant can't, and this is the one that you find e.g. in Windows XP/Vista.

See Wikipedia for more information.

Edit: I am wrong about Windows, NT was the only one to support UCS-2. However, many Windows applications will assume a single word per code point as in UCS-2, so you are likely to find bugs. See another Wikipedia article. (Thanks JasonTrue)

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • Actually Windows XP/Vista support UTF-16, but many apps assume unicode data is UCS2 in cases when they should be checking for surrogate pairs. This is usually not a problem for simple cases, but a mess for character iteration, caret placement, or truncating strings. – JasonTrue Sep 25 '08 at 02:23
  • Way back when I tested with Windows 2000 it looked like it used UCS-2. I'm wondering whether those fonts simply weren't installed in my version of W2k... – Alexis Wilke Jan 09 '14 at 22:40
  • @AlexisWilke, the only way to know is to display a valid character from one of the upper planes and hope it displays the unknown character replacement box - if it displays two boxes it's UCS-2, if only one then it's UTF-16. – Mark Ransom Jan 09 '14 at 22:54
  • Ah! Good point. I do not remember that detail and I cannot install W2k on my computers anywmore... Too old for the new hardware. Plus Ubuntu is awesome. – Alexis Wilke Jan 10 '14 at 01:21