641

What are the differences between UTF-8, UTF-16, and UTF-32?

I understand that they will all store Unicode, and that each uses a different number of bytes to represent a character. Is there an advantage to choosing one over the other?

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
  • 69
    Watch this video if you are interested in how Unicode works http://www.youtube.com/watch?v=MijmeoH9LT4 –  Sep 30 '13 at 17:04
  • 1
    The video focuses on UTF-8, and yes it explains well how variable length encoding works and is mostly compatible with computers reading or writing only fixed length ASCII. Unicode guys were smart when designing UTF-8 encoding. – mins Jul 02 '14 at 19:33
  • 2
    UTF-8 is the de-facto standard in most modern software for **saved files**. More specifically, it's the most widely used encoding for HTML and configuration and translation files (Minecraft, for example, doesn't accept any other encoding for all its text information). UTF-32 is **fast for internal memory representation**, and UTF-16 is kind of **deprecated**, currently used only in Win32 for historical reasons (**UTF-16 was fixed-length** when Windows 95 was a thing) – Kotauskas May 29 '19 at 19:48
  • 1
    @VladislavToncharov UTF-16 was never a fixed length encoding. You're confusing it with UCS-2. –  Aug 27 '19 at 12:29
  • @Kotauskas Javascript still uses UTF-16 for almost everything – Radvylf Programs Oct 17 '20 at 14:53
  • @user60456 - I clicked the link, saw Tom Scott, and automatically upvoted your comment before even watching the video b/c Tom is freaking awesome and has a gift for conveying information. Thank you for the link. – GroggyOtter Feb 04 '23 at 00:43

14 Answers14

497

UTF-8 has an advantage in the case where ASCII characters represent the majority of characters in a block of text, because UTF-8 encodes these into 8 bits (like ASCII). It is also advantageous in that a UTF-8 file containing only ASCII characters has the same encoding as an ASCII file.

UTF-16 is better where ASCII is not predominant, since it uses 2 bytes per character, primarily. UTF-8 will start to use 3 or more bytes for the higher order characters where UTF-16 remains at just 2 bytes for most characters.

UTF-32 will cover all possible characters in 4 bytes. This makes it pretty bloated. I can't think of any advantage to using it.

Hong Ooi
  • 56,353
  • 13
  • 134
  • 187
AnthonyWJones
  • 187,081
  • 35
  • 232
  • 306
  • 219
    UTF-32 advantage: you don't need to decode stored data to the 32-bit Unicode code point for e.g. character by character handling. The code point is already available right there in your array/vector/string. – richq Jan 30 '09 at 17:48
  • @rq: You're quite right and Adam makes the same point. However, most character by character handling I've seen works with 16 bit short ints not with a vector of 32 bit integers. In terms of raw speed some operations will be quicker with 32 bits. – AnthonyWJones Jan 30 '09 at 18:07
  • 38
    It's also easier to parse if (heaven help you) you have to re-implement the wheel. – Paul McMillan Sep 29 '09 at 18:58
  • UTF32 advantage: When transferring over network especially in UDP, it's a good thing to know that 4 bytes is always one character, in scenarios where all characters are needed. – Mathias Lykkegaard Lorenzen Dec 19 '11 at 12:16
  • 39
    Well, UTF-8 has an advantage in network transfers - no need to worry about endianness since you're transfering data one byte at a time (as opposed to 4). – Tim Čas Dec 31 '11 at 14:20
  • 39
    @richq You can't do character-by-character handling in UTF-32, as code point does not always correspond to a character. – hamstergene Nov 13 '12 at 17:27
  • 1
    -1 just because there are a couple of advantages to UTF-32 even if they're not important enough to make many projects want to choose it. The only projects I know of that uses it is the word processor [AbiWord](http://www.abisource.com/). – hippietrail Mar 14 '13 at 21:30
  • 1
    @hippietrail could you name some advantages of utf-32, which are not already mentioned by other comments? – n611x007 Jun 06 '14 at 15:14
  • 1
    @naxa: The advantages of UTF-32 are already mentioned by other comments. AnthonyWJones said "I **can't think of any** advantage to use it." – hippietrail Jun 06 '14 at 22:42
  • 1
    Personally I use UTF-8 always except with Windows API code where I use UTF-16. Years ago when I was involved with AbiWord they chose to use UTF-32 internally and it is still the only project I know of to do this. I don't know if they stuck with it. Just because some people mix up "character" and "codepoint" doesn't mean that there's no advantages to knowing that codepoints are all a fixed size. – hippietrail Jun 06 '14 at 22:49
  • 11
    UTF-32 advantage: string manipulation is possibly faster compared to the utf-8 equivalent – Wes Jun 06 '15 at 16:36
  • @Wes: I doubt it; most of string handling is not done on a per-character basis (things like substring searches work equally well on ASCII, UTF-8, or arbitrary arrays of bytes, regardless of character data they [might] encode). If it is, then UTF-32 does *not* suffice (you can have multiple units per character even in fully-normalized UTF-32!). If anything, it makes it slower due to (roughly) 4x the data for copying. – Tim Čas Jun 24 '16 at 22:36
  • @Wes: I found a good source for this: [Figure 5 in this (official!) document](http://unicode.org/reports/tr15/#Multiple_Mark_Figure). Note how even the normalized characters are multiple code points (so, even in UTF-32). – Tim Čas Jun 24 '16 at 22:41
  • 5
    @TimČas Talking of code points, not graphemes. Locating a code point by offset is a very intensive operation in utf-8 as it requires full iteration with "jumps" of 2->4 bytes, while utf-32 has actual random access. Substring operations are faster consequently. Instead, as you said, locating graphemes requires full traversal in both encodings, but in utf-32 less jumps will be required. – Wes Jun 25 '16 at 10:02
  • 1
    @Wes: But what substring operations would need that? For example, finding a substring works just as well on UTF-8 as it does on UTF-32 (you're just finding a specific sequence of `uint8`s / `uint32`s). The index returned can *directly* be used for (say) slicing to the end of the string in both cases. – Tim Čas Jun 25 '16 at 16:45
  • 3
    works just as well, but that's not random access. for instance just knowing the length of a string (in code points) would require a full traversal of the byte array, while with utf-32 it's just sizeof(codepoints) – Wes Jun 26 '16 at 08:39
  • Another advantage of UTF8 is you don't need to duplicate your API. Like those nasty windows W versions of the API. Why didn't they adopt UTF8? – hookenz Dec 18 '17 at 01:59
  • 4
    Another way to describe UTF32's ability for random access is to say string slicing is O(1) in UTF32 and O(n) in UTF8 even in best cases. – Rich Remer Mar 28 '18 at 22:53
  • 2
    `utf-32` is not only more efficient for string operations (it supports random access; enough said!), but it's also simpler to manipulate by virtue of being a fixed-size array (I dare you work with `utf-8` in `C`...) – étale-cohomology Jun 09 '18 at 10:38
  • @Nawaz Note that some of the bits are used to identify what the size of the character is, so you don't get to use the entire 8 or 16 etc bits for your character. – Aaron Franke Jan 22 '19 at 10:19
  • @ hippietrail D language uses utf-32 (till it supports utf8/16/32 natively) in iteration for example – RandomB Oct 26 '20 at 05:53
  • @TimČas I am not sure about "slower" argument: in most cases blocks are transferring, not bytes (all external devices). In the case of DMA I have 2 ideas: 1) I dont know the difference b/w 16 bits vs 32 bits modes, but often it's better to use full register size then just a part of it *ALSO* performance of 32 bits may be the same as of 16 bits (everything happens on clock signal) 2) if we have 200Mb/s then we can accept that the times to copy of 1024 bytes and 64 bytes is super close – RandomB Oct 26 '20 at 06:01
  • 1
    @étale-cohomologyRandom access *to "characters"* (what Unicode technically calls "grapheme clusters") in UTF-32 is a myth. Even fully-normalized UTF-32 uses combining characters (consider emoji!). And like I said, you pretty much never need code point random access. – Tim Čas Oct 27 '20 at 22:47
  • @RandomB Memory-copy operations on byte streams typically *do* use word-sized moves. A common optimization of `memcpy` is to split the copy into unaligned copy (first 0-7 bytes), and then do 64-bit copies for the main part --- plus another partial at the end. All that matters in the end (for non-trivially-small sizes) is the amount of data, not what the base units are. – Tim Čas Oct 27 '20 at 22:49
  • I think the question is not random or not, but how often jumps happen, about statistics, in UTF-8 they happen more often, so all iterations will be slower. – RandomB Oct 28 '20 at 06:34
  • @TimČas I very much disagree with you there. Slicing and getting specific characters is extremely useful and common. I think you're getting characters confused with graphemes; UTF-32 gives you O(1) indexing into a string of characters (not graphemes), but I'd guess that a very large percentage of string manipulation doesn't actually care about graphemes. – Radvylf Programs Sep 08 '21 at 20:17
  • @RedwolfPrograms "Character" is ambigious in Unicode (https://unicode.org/glossary/#character), but people typically mean "grapheme cluster" or "code point" when they say it. I'm not sure which one *you* mean. But anyway, go on, name 1 scenario where you need to deal in anything but strings-as-whole-units *or* graphemes, other than rendering glyphs (where you need code points in order to reference TTF/OTF internal tables, making it a bit of a circular argument). – Tim Čas Sep 21 '21 at 10:08
  • @TimČas Any situation where how the string actually looks just...doesn't matter. Which is most. E.g., I write a lot of interpreters. They don't care if an emoji is part of a grapheme or on its own, it's just a part of an identifier (potentially along with a ZWJ and something else). There has literally never been a situation where I've had to handle grapheme clusters in my code, but I do stuff involving string manipulation basically every day. – Radvylf Programs Sep 21 '21 at 13:30
421

In short:

  • UTF-8: Variable-width encoding, backwards compatible with ASCII. ASCII characters (U+0000 to U+007F) take 1 byte, code points U+0080 to U+07FF take 2 bytes, code points U+0800 to U+FFFF take 3 bytes, code points U+10000 to U+10FFFF take 4 bytes. Good for English text, not so good for Asian text.
  • UTF-16: Variable-width encoding. Code points U+0000 to U+FFFF take 2 bytes, code points U+10000 to U+10FFFF take 4 bytes. Bad for English text, good for Asian text.
  • UTF-32: Fixed-width encoding. All code points take four bytes. An enormous memory hog, but fast to operate on. Rarely used.

In long: see Wikipedia: UTF-8, UTF-16, and UTF-32.

Adam Rosenfield
  • 390,455
  • 97
  • 512
  • 589
  • 4
    The reason UTF-16 works is that U+D800–U+DFFF are left as a gap in the BMP for the surrogate pair pairs. Clever. – Douglas Leeder Jan 30 '09 at 17:19
  • 70
    @spurrymoses: I'm referring strictly to the amount of space taken up by the data bytes. UTF-8 requires 3 bytes per Asian character, while UTF-16 only requires 2 bytes per Asian character. This really isn't a major problem, since computers have tons of memory these days compared to the average amount of text stored in a program's memory. – Adam Rosenfield Aug 04 '09 at 14:19
  • 14
    UTF-32 isn't rarely used anymore... on osx and linux `wchar_t` defaults to 4 bytes. gcc has an option `-fshort-wchar` which reduces the size to 2 bytes, but breaks the binary compatibility with std libs. – vine'th Nov 14 '11 at 10:12
  • 9
    @PandaWood ofcource UTF-8 can encode any character! But have you compared the memory requirement with that for UTF-16? You seem to be missing the point! – Ustaman Sangat Dec 15 '11 at 16:57
  • 17
    If someone were to say UTF-8 is "not so good for Asian text" in the context of All Encoding Formats Including Those That Cannot Encode Unicode, they would of course be wrong. But that is not the context. The context of memory requirements comes from the fact that the question (and answer) is comparing UTF-8, UTF-16 and UTF-32, which will all encode Asian text but use differing amounts of memory/storage. It follows that their relative goodness would naturally be entirely in the context of memory requirements. "Not so good" != "not good". – Paul Gregory Jan 23 '13 at 11:39
  • 1
    @vine'th: `wchar_t` has not gained much popularity that I've seen. The fact that it's 16 bits wide on Windows and 32 bits wide on *nix is a probably contributor to its lack of acceptance. In *nix most projects eschew `wchar_t` and just use `char` with UTF-8. – hippietrail Mar 14 '13 at 21:51
  • 3
    Wikipedia remarks that in real world usage, it appeared that UTF-8 is smaller then UTF-16 even when using non English characters because of the amount of spaces or english word still used in text. – Didier A. Mar 20 '13 at 15:56
  • Is there no reference source available more trustworthy than Wikipedia? (Not that stackoverflow is any better in that regard...) – McGafter Oct 18 '13 at 09:18
  • 8
    @McGafter: Well of course there is. If you want trustworthiness, go straight to the horse's mouth at [The Unicode Consortium](http://www.unicode.org/versions/Unicode6.3.0/). See chapter 2.5 for a description of the UTF-* encodings. But for obtaining a simple, high-level understanding of the encodings, I find that the Wikipedia articles are a much more approachable source. – Adam Rosenfield Oct 18 '13 at 16:50
  • 2
    @PandaWood web pages contain a lot of ASCII characters that aren't part of the body text, so UTF-8 is a good choice for those no matter what language you're using. – Mark Ransom Dec 10 '15 at 04:28
  • 4
    While UTF-8 does take 3 bytes for most Asian characters vs 2 for UTF-16 (some Chinese characters in common use ended up in the multilingual plane where they take 4 bytes in both UTF-8 and UTF-16), in practice this does not make much difference because real documents often have a large number of ASCII characters mixed in. See http://utf8everywhere.org/#asian for side-by-side size comparisons of one real document: UTF-8 actually took *50% fewer bytes* to encode a Japanese-language HTML page (the Wikipedia article on Japan, in Japanese) than UTF-16 did. – rmunn Aug 02 '17 at 03:31
  • 1
    absolutely truth about Chinese language: I created 2 files with Notepad with Chinese text: I saved one in UTF-16 and another one in UTF-8. The ratio is `utf8_size/utf16_size = 1.4` (about 4K vs 2K). Cyrillic ratio is different: `utf16_size/utf8_size = 1.14` (about 13K vs 11K) – RandomB Oct 26 '20 at 06:12
156
  • UTF-8 is variable 1 to 4 bytes.

  • UTF-16 is variable 2 or 4 bytes.

  • UTF-32 is fixed 4 bytes.

beeselmane
  • 1,111
  • 8
  • 26
Quassnoi
  • 413,100
  • 91
  • 616
  • 614
  • 49
    UTF8 is actually 1 to 6 bytes. – Urkle Feb 24 '14 at 21:17
  • 8
    @Urkle is technically correct because mapping the full range of UTF32/LE/BE includes U-00200000 - U-7FFFFFFF even though Unicode v6.3 ends at U-0010FFFF inclusive. Here's a nice breakdown of how to enc/dec 5 and 6 byte utf8: https://lists.gnu.org/archive/html/help-flex/2005-01/msg00030.html –  May 13 '14 at 23:08
  • 4
    backing up these with relevant references parts and their sources? – n611x007 Jun 06 '14 at 15:15
  • 31
    @Urkle No, UTF-8 can not be 5 or 6 bytes. Unicode code points are limited to 21 bits, which limits UTF-8 to 4 bytes. (You could of course extend the principle of UTF-8 to encode arbitrary large integers, but it would not be Unicode.) See RFC 3629. – rdb Sep 20 '15 at 14:17
  • Why are all you varying the size? UTF-8 1-4 bytes.. then 1-6. then why other UTFs? – Asif Mushtaq Apr 02 '16 at 19:27
  • 17
    Quoting Wikipedia: In November 2003, UTF-8 was restricted by RFC 3629 to match the constraints of the UTF-16 character encoding: explicitly prohibiting code points corresponding to the high and low surrogate characters removed more than 3% of the three-byte sequences, and ending at U+10FFFF removed more than 48% of the four-byte sequences and all five- and six-byte sequences. – Adam Calvet Bohl Jan 20 '17 at 11:48
  • Could the standard be extended in the future to allow any of these to use 5 bytes, or are they limited to 4 exactly in some technical way? – Aaron Franke Jan 22 '19 at 10:22
  • 2
    @AaronFranke: the first byte can define up to 7 continuation bytes, so it can technically be extended up to 8 bytes (36 payload bits ~ 68 billion codepoints) per sequence. – Quassnoi Jan 28 '19 at 23:20
  • @quas 7 continuation bytes at 6 payload bits makes 42. – Deduplicator Feb 06 '22 at 16:04
  • @Deduplicator: I meant 6 continuation bytes, of course, thanks for noticing – Quassnoi Feb 06 '22 at 17:01
  • @Quassnoi Well, 0xFF could be invalid, or signal 7 continuation bytes... – Deduplicator Feb 07 '22 at 16:11
  • @Deduplicator: you're right, this way it can be 42 bits indeed – Quassnoi Feb 07 '22 at 19:49
107

Unicode defines a single huge character set, assigning one unique integer value to every graphical symbol (that is a major simplification, and isn't actually true, but it's close enough for the purposes of this question). UTF-8/16/32 are simply different ways to encode this.

In brief, UTF-32 uses 32-bit values for each character. That allows them to use a fixed-width code for every character.

UTF-16 uses 16-bit by default, but that only gives you 65k possible characters, which is nowhere near enough for the full Unicode set. So some characters use pairs of 16-bit values.

And UTF-8 uses 8-bit values by default, which means that the 127 first values are fixed-width single-byte characters (the most significant bit is used to signify that this is the start of a multi-byte sequence, leaving 7 bits for the actual character value). All other characters are encoded as sequences of up to 4 bytes (if memory serves).

And that leads us to the advantages. Any ASCII-character is directly compatible with UTF-8, so for upgrading legacy apps, UTF-8 is a common and obvious choice. In almost all cases, it will also use the least memory. On the other hand, you can't make any guarantees about the width of a character. It may be 1, 2, 3 or 4 characters wide, which makes string manipulation difficult.

UTF-32 is opposite, it uses the most memory (each character is a fixed 4 bytes wide), but on the other hand, you know that every character has this precise length, so string manipulation becomes far simpler. You can compute the number of characters in a string simply from the length in bytes of the string. You can't do that with UTF-8.

UTF-16 is a compromise. It lets most characters fit into a fixed-width 16-bit value. So as long as you don't have Chinese symbols, musical notes or some others, you can assume that each character is 16 bits wide. It uses less memory than UTF-32. But it is in some ways "the worst of both worlds". It almost always uses more memory than UTF-8, and it still doesn't avoid the problem that plagues UTF-8 (variable-length characters).

Finally, it's often helpful to just go with what the platform supports. Windows uses UTF-16 internally, so on Windows, that is the obvious choice.

Linux varies a bit, but they generally use UTF-8 for everything that is Unicode-compliant.

So short answer: All three encodings can encode the same character set, but they represent each character as different byte sequences.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
jalf
  • 243,077
  • 51
  • 345
  • 550
  • 17
    It is inaccurate to say that Unicode assigns a unique integer to each **graphical symbol**. It assigns such to each code point, but some code points are **invisible control characters**, and some graphical symbols require **multiple code points** to represent. – tchrist Mar 07 '12 at 01:58
  • 21
    @tchrist: yes, it's inaccurate. The problem is that to accurately explain Unicode, you need to write thousands of pages. I hoped to get the basic concept across to explain the difference between encodings – jalf Mar 07 '12 at 09:42
  • @jalf lol right so basically to explain Unicode you would have to write the [Unicode Core Specification](http://www.unicode.org/glossary/#core_specification) – Justin Ohms Jul 08 '16 at 18:07
  • @tchrist More specifically, you can construct Chinese symbols out of provided primitives (but they're in the same chart, so you'll just end up using unreal amount of space - either disk or RAM - to encode them) instead of using the built-in ones. – Kotauskas May 29 '19 at 19:54
  • 2
    Best answer by far – z33k Aug 19 '20 at 10:06
  • 2
    Note that the description of UTF-32 is incorrect. Each character is not 4 bytes wide. Each code point is 4 bytes wide, and some characters may require multiple code points. Computing string length is not just the number of bytes divided by 4, you have to walk the whole string and decode each code point to resolve these clusters. – CDahn Feb 01 '22 at 06:16
57

Unicode is a standard and about UTF-x you can think as a technical implementation for some practical purposes:

  • UTF-8 - "size optimized": best suited for Latin character based data (or ASCII), it takes only 1 byte per character but the size grows accordingly symbol variety (and in worst case could grow up to 6 bytes per character)
  • UTF-16 - "balance": it takes minimum 2 bytes per character which is enough for existing set of the mainstream languages with having fixed size on it to ease character handling (but size is still variable and can grow up to 4 bytes per character)
  • UTF-32 - "performance": allows using of simple algorithms as result of fixed size characters (4 bytes) but with memory disadvantage
rogerdpack
  • 62,887
  • 36
  • 269
  • 388
rook
  • 5,880
  • 4
  • 39
  • 51
47

I tried to give a simple explanation in my blogpost.

UTF-32

requires 32 bits (4 bytes) to encode any character. For example, in order to represent the "A" character code-point using this scheme, you'll need to write 65 in 32-bit binary number:

00000000 00000000 00000000 01000001 (Big Endian)

If you take a closer look, you'll note that the most-right seven bits are actually the same bits when using the ASCII scheme. But since UTF-32 is fixed width scheme, we must attach three additional bytes. Meaning that if we have two files that only contain the "A" character, one is ASCII-encoded and the other is UTF-32 encoded, their size will be 1 byte and 4 bytes correspondingly.

UTF-16

Many people think that as UTF-32 uses fixed width 32 bit to represent a code-point, UTF-16 is fixed width 16 bits. WRONG!

In UTF-16 the code point maybe represented either in 16 bits, OR 32 bits. So this scheme is variable length encoding system. What is the advantage over the UTF-32? At least for ASCII, the size of files won't be 4 times the original (but still twice), so we're still not ASCII backward compatible.

Since 7-bits are enough to represent the "A" character, we can now use 2 bytes instead of 4 like the UTF-32. It'll look like:

00000000 01000001

UTF-8

You guessed right.. In UTF-8 the code point maybe represented using either 32, 16, 24 or 8 bits, and as the UTF-16 system, this one is also variable length encoding system.

Finally we can represent "A" in the same way we represent it using ASCII encoding system:

01001101

A small example where UTF-16 is actually better than UTF-8:

Consider the Chinese letter "語" - its UTF-8 encoding is:

11101000 10101010 10011110

While its UTF-16 encoding is shorter:

10001010 10011110

In order to understand the representation and how it's interpreted, visit the original post.

Maroun
  • 94,125
  • 30
  • 188
  • 241
  • https://stackoverflow.com/questions/3864842/should-i-change-from-utf-8-to-utf-16-to-accommodate-chinese-characters-in-my-htm#:~:text=It's%20not%20that%20UTF%2D8,represented%20still%20as%201%20byte. – Smart Manoj Sep 16 '20 at 07:34
  • How does computers don't 'drop' UTF-32 encode numbers that contains alot of zeros? like representing 'A' will contain 26-27 zeros... – Arik Jordan Graham Jan 05 '22 at 20:17
23

UTF-8

  • has no concept of byte-order
  • uses between 1 and 4 bytes per character
  • ASCII is a compatible subset of encoding
  • completely self-synchronizing e.g. a dropped byte from anywhere in a stream will corrupt at most a single character
  • pretty much all European languages are encoded in two bytes or less per character

UTF-16

  • must be parsed with known byte-order or reading a byte-order-mark (BOM)
  • uses either 2 or 4 bytes per character

UTF-32

  • every character is 4 bytes
  • must be parsed with known byte-order or reading a byte-order-mark (BOM)

UTF-8 is going to be the most space efficient unless a majority of the characters are from the CJK (Chinese, Japanese, and Korean) character space.

UTF-32 is best for random access by character offset into a byte-array.

Community
  • 1
  • 1
Jeff Adamson
  • 387
  • 2
  • 5
  • How does "self synchronizing" work in UTF-8? Can you give examples for 1 byte and 2 byte characters? – Koray Tugay Jul 08 '16 at 16:57
  • 2
    @KorayTugay Valid shorter byte strings are never used in longer characters. For instance, ASCII is in the range 0-127, meaning all one-byte characters have the form `0xxxxxxx` in binary. All two-byte characters begin with `110xxxxx` with a second byte of `10xxxxxx`. So let's say the first character of a two-byte character is lost. As soon as you see `10xxxxxx` without a preceding `110xxxxxx`, you can determine for sure that a byte was lost or corrupted, and discard that character (or re-request it from a server or whatever), and move on until you see a valid first byte again. – Chris Aug 01 '17 at 23:38
  • 1
    if you have the offset to a character, you have the offset to that character -- utf8, utf16 or utf32 will work just the same in that case; i.e. they are all equally good at random access by character offset into a byte array. The idea that utf32 is better at counting characters than utf8 is also completely false. A *codepoint* (which is *not* the same as a character which again, is not the same as a grapheme.. sigh), is 32 bits wide in utf32 and between 8 and 32 bits in utf8, but a character may span multiple codepoints, which destroys the major advantage that people claim utf32 has over utf8. – Clearer Nov 25 '17 at 22:19
  • @Clearer But how often do you need to work with characters/graphemes rather than just codepoints? I have worked on a number of projects involving heavy string manipulation, and being able to slice/index codepoints in O(1) really is very helpful. – Radvylf Programs Sep 08 '21 at 20:26
  • @RedwolfPrograms Today I don't, but I used to work in language anaylsis, where it was very important. – Clearer Sep 27 '21 at 06:48
15

I made some tests to compare database performance between UTF-8 and UTF-16 in MySQL.

Update Speeds

UTF-8

Enter image description here

UTF-16

Enter image description here

Insert Speeds

Enter image description here

Enter image description here

Delete Speeds

Enter image description here

Enter image description here

Community
  • 1
  • 1
Farid Movsumov
  • 12,350
  • 8
  • 71
  • 97
  • 3
    Just one short string doesn't mean anything, just one record even less, the time differences may have been due to other factors, Mysql's own internal mechanisms, if you want to do a reliable test, you would need to use at least 10,000 records with a 200 character string, and it would need to be a set of tests, with some scenarios, at least about 3, so it would isolate the encoding factor – danilo Oct 13 '20 at 16:29
15

In UTF-32 all of characters are coded with 32 bits. The advantage is that you can easily calculate the length of the string. The disadvantage is that for each ASCII characters you waste an extra three bytes.

In UTF-8 characters have variable length, ASCII characters are coded in one byte (eight bits), most western special characters are coded either in two bytes or three bytes (for example € is three bytes), and more exotic characters can take up to four bytes. Clear disadvantage is, that a priori you cannot calculate string's length. But it's takes lot less bytes to code Latin (English) alphabet text, compared to UTF-32.

UTF-16 is also variable length. Characters are coded either in two bytes or four bytes. I really don't see the point. It has disadvantage of being variable length, but hasn't got the advantage of saving as much space as UTF-8.

Of those three, clearly UTF-8 is the most widely spread.

Ahmad F
  • 30,560
  • 17
  • 97
  • 143
vartec
  • 131,205
  • 36
  • 218
  • 244
  • Why would I want to calculate the length of the string while developing websites? Is there any advantage of choosing UTF-8/UTF-16 in web development? – Morfidon Sep 08 '17 at 08:07
  • 1
    "The advantage is that you can easily calculate the length of the string" If you define length by the # of codepoints, then yes, you can just divide the byte length by 4 to get it with UTF-32. That's not a very useful definition, however : it may not relate to the number of characters. Also, normalization may alter the number of codepoints in the string. For example, the french word "été" can be encoded in at least 4 different ways, with 3 distinct codepoint lengths. –  Aug 27 '19 at 12:37
  • 1
    UTF-16 is possibly faster than UTF-8 while also no wasting memory like UTF-32 does. – DexterHaxxor Jun 05 '20 at 11:58
  • @MichalŠtein But it also gives you the worst of both worlds; it uses up more space than UTF-8 for ASCII, but it also has all of the same issues caused by having multiple codepoints per character (in addition to potential endianness issues). – Radvylf Programs Sep 08 '21 at 20:38
11

I'm surprised this question is 11yrs old and not one of the answers mentioned the #1 advantage of utf-8.

utf-8 generally works even with programs that are not utf-8 aware. That's partly what it was designed for. Other answers mention that the first 128 code points are the same as ASCII. All other code points are generated by 8bit values with the high bit set (values from 128 to 255) so that from the POV of a non-unicode aware program it just sees strings as ASCII with some extra characters.

As an example let's say you wrote a program to add line numbers that effectively does this (and to keep it simple let's assume end of line is just ASCII 13)

// pseudo code

function readLine
  if end of file
     return null
  read bytes (8bit values) into string until you hit 13 or end or file
  return string

function main
  lineNo = 1
  do {
    s = readLine
    if (s == null) break;
    print lineNo++, s
  }  

Passing a utf-8 file to this program will continue to work. Similarly, splitting on tabs, commas, parsing for ASCII quotes, or other parsing for which only ASCII values are significant all just work with utf-8 because no ASCII value appear in utf-8 except when they are actually meant to be those ASCII values

Some other answers or comments mentions that utf-32 has the advantage that you can treat each codepoint separately. This would suggest for example you could take a string like "ABCDEFGHI" and split it at every 3rd code point to make

ABC
DEF
GHI

This is false. Many code points affect other code points. For example the color selector code points that lets you choose between ‍‍‍‍‍. If you split at any arbitrary code point you'll break those.

Another example is the bidirectional code points. The following paragraph was not entered backward. It is just preceded by the 0x202E codepoint

  • ‮This line is not typed backward it is only displayed backward

So no, utf-32 will not let you just randomly manipulate unicode strings without a thought to their meanings. It will let you look at each codepoint with no extra code.

FYI though, utf-8 was designed so that looking at any individual byte you can find out the start of the current code point or the next code point.

If you take a arbitrary byte in utf-8 data. If it is < 128 it's the correct code point by itself. If it's >= 128 and < 192 (the top 2 bits are 10) then to find the start of the code point you need to look the preceding byte until you find a byte with a value >= 192 (the top 2 bits are 11). At that byte you've found the start of a codepoint. That byte encodes how many subsequent bytes make the code point.

If you want to find the next code point just scan until the byte < 128 or >= 192 and that's the start of the next code point.

Num Bytes 1st code point last code point Byte 1 Byte 2 Byte 3 Byte 4
1 U+0000 U+007F 0xxxxxxx
2 U+0080 U+07FF 110xxxxx 10xxxxxx
3 U+0800 U+FFFF 1110xxxx 10xxxxxx 10xxxxxx
4 U+10000 U+10FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Where xxxxxx are the bits of the code point. Concatenate the xxxx bits from the bytes to get the code point

samanthaj
  • 521
  • 3
  • 14
6

Depending on your development environment you may not even have the choice what encoding your string data type will use internally.

But for storing and exchanging data I would always use UTF-8, if you have the choice. If you have mostly ASCII data this will give you the smallest amount of data to transfer, while still being able to encode everything. Optimizing for the least I/O is the way to go on modern machines.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
mghie
  • 32,028
  • 6
  • 87
  • 129
  • 1
    Arguably, a lot more important than space requirements is the fact, that UTF-8 is immune to endianness. UTF-16 and UTF-32 will inevitably have to deal with endianness issues, where UTF-8 is simply a stream of octets. – IInspectable Sep 02 '18 at 18:26
2

As mentioned, the difference is primarily the size of the underlying variables, which in each case get larger to allow more characters to be represented.

However, fonts, encoding and things are wickedly complicated (unnecessarily?), so a big link is needed to fill in more detail:

http://www.cs.tut.fi/~jkorpela/chars.html#ascii

Don't expect to understand it all, but if you don't want to have problems later it's worth learning as much as you can, as early as you can (or just getting someone else to sort it out for you).

Paul.

Paul W Homer
  • 2,728
  • 1
  • 19
  • 25
1

After reading through the answers, UTF-32 needs some loving.

C#:

Data1 = RandomNumberGenerator.GetBytes(500_000_000);

sw = Stopwatch.StartNew();
int l = Encoding.UTF8.GetString(Data1).Length;
sw.Stop();
Console.WriteLine($"UTF-8: Elapsed - {sw.ElapsedMilliseconds * .001:0.000s}   Size - {l:###,###,###}");

sw = Stopwatch.StartNew();
l = Encoding.Unicode.GetString(Data1).Length;
sw.Stop();
Console.WriteLine($"Unicode: Elapsed - {sw.ElapsedMilliseconds * .001:0.000s}   Size - {l:###,###,###}");

sw = Stopwatch.StartNew();
l = Encoding.UTF32.GetString(Data1).Length;
sw.Stop();
Console.WriteLine($"UTF-32: Elapsed - {sw.ElapsedMilliseconds * .001:0.000s}   Size - {l:###,###,###}");

sw = Stopwatch.StartNew();
l = Encoding.ASCII.GetString(Data1).Length;
sw.Stop();
Console.WriteLine($"ASCII: Elapsed - {sw.ElapsedMilliseconds * .001:0.000s}   Size - {l:###,###,###}");

UTF-8 -- Elapsed 9.939s - Size 473,752,800

Unicode -- Elapsed 0.853s - Size 250,000,000

UTF-32 -- Elapsed 3.143s - Size 125,030,570

ASCII -- Elapsed 2.362s - Size 500,000,000

UTF-32 -- MIC DROP

-2

In short, the only reason to use UTF-16 or UTF-32 is to support non-English and ancient scripts respectively.

I was wondering why anyone would chose to have non-UTF-8 encoding when it is obviously more efficient for web/programming purposes.

A common misconception - the suffixed number is NOT an indication of its capability. They all support the complete Unicode, just that UTF-8 can handle ASCII with a single byte, so is MORE efficient/less corruptible to the CPU and over the internet.

Some good reading: http://www.personal.psu.edu/ejp10/blogs/gotunicode/2007/10/which_utf_do_i_use.html and http://utf8everywhere.org

killjoy
  • 940
  • 1
  • 11
  • 16
  • I'm not sure, why you suggest, that using UTF-16 or UTF-32 were to support non-English text. UTF-8 can handle that just fine. And there are non-ASCII characters in English text, too. Like a zero-width non-joiner. Or an em dash. I'm afraid, this answer doesn't add much value. – IInspectable Sep 02 '18 at 18:30
  • This question is liable to downvoting because UTF-8 is still commonly used in HTML files even if the majority of the characters are 3-byte characters in UTF-8, – Ṃųỻịgǻňạcểơửṩ Dec 05 '19 at 23:22
  • @IInspectable support is not the best wording, promote or better support would be more accurate – robotik Feb 28 '20 at 10:37
  • Sending a page like http://utf8everywhere.org is not what I would do in a SO answer. – DexterHaxxor Jun 05 '20 at 12:03