UTF-8 uses and alternatives

Question

Under what circumstances would you recommend using UTF-8? Is there an alternative to it that will serve the same purpose?

UTF-8 is being used for i18n?

Maybe because of second question? "UTF-8 is being used for i18n?". Is not very clear what you mean by that. — Paweł Dyda, Nov 29 '10 at 20:15

Christoffer · Accepted Answer · 2010-11-29T19:41:35.637

Since you tagged this with web design, I assume you need to optimize the code size to be as small as possible to transfer files quickly.

The alternatives to UTF-8 would be the other Unicode encodings, since there is no alternative to using Unicode (for regular computer systems at least).

If you look at how UTF-8 is specified, you'll see that all code points up to U+007F will require one octet, and code points up to U+07FF will require two octets, up to U+FFFF three and four octets for code points up to U+10FFFF. For UTF-16, you will need two octets up to U+FFFF (mostly), and four octets for values up to U+10FFFF. For UTF-32, you need four octets for all unicode points.

In other words, scripts that lie under U+07FF will have some size benefit from using UTF-8 compared to UTF-16, while scripts above that will have some size penalty. However, since the domain is web design, it might be worth noting that all control characters lie in the one-octet range of UTF-8, which makes this less true for texts with lots of, say, HTML markup and Javascript, compared to the amount of actual "text".

Scripts under U+07FF include Latin (except some extensions such as tone marks), Greek, Cyrillic, Hebrew and probably some more. Wikipedia has pretty good coverage on Unicode issues, and on the Unicode Consortium you can get even more details.

score 2 · Answer 2 · answered Nov 29 '10 at 20:07

Since you are asking for recommendations, I recommend you to use it at any circumstances. All the time, i.e. for HTML files and textual resources. For English-only application it doesn't change a thing, but when you need to actually localize it, having UTF-8 in the first place would be a benefit (you won't need to re-visit your code and change it; one source of defects less).

As for other Unicode family encodings (like especially UTF-16), I would not recommend to use them for web application. Although bandwidth consumption might be actually higher for i.e. Chinese characters (at least three bytes all the time), you'll avoid problems with transmission and browser interpretation (yeah, I know that in theory it should all work the same, unfortunately in practice it tends to break).

score 1 · Answer 3 · answered Nov 29 '10 at 19:23

1

Use UTF-8 all the way. No excuses.

answered Nov 29 '10 at 19:23

BalusC

1,082,665
372
3,610
3,555

1

unicode all the way I'd agree with, not necessarily utf8 though. – dan_waterworth Nov 29 '10 at 19:28

score -6 · Answer 4 · answered Nov 29 '10 at 19:13

-6

use utf-8 for latin languages. utf-16 for every other language.

answered Nov 29 '10 at 19:13

dan_waterworth

6,261
1
30
41

but UTF-16 is not backwards compatible to ASCII. – DarthVader Nov 29 '10 at 19:20
3

UTF-8 supports every other language perfectly. You're probably confusing with ISO-8859. The only difference is that UTF-16 is 4-byte wide while UTF-8 has a variable byte width (and thus consumes less bytes). – BalusC Nov 29 '10 at 19:21
@user177883, then you should have said that that was a constraint in the question. – dan_waterworth Nov 29 '10 at 19:23
@BalusC, consumes less space for latin languages – dan_waterworth Nov 29 '10 at 19:24
@BalusC: I would be very happy if your statement "supports every other language perfectly" be true. There are some problems caused by Supplementary Chinese Characters (the ones defined by GB18030:2005); Although it is not UTF-xx, it is actually Unicode Standard-related. Well, arguably version 6.0 supports them, but we've yet to see this standard implemented... For now "perfectly" is overemphasis... – Paweł Dyda Nov 29 '10 at 20:13
@Dan: Space is not always the biggest concern. You probably wouldn't want to fight some stupid browser idea to interpret Cyrillic as Chinese (ignoring charset declaration); At the same time, deploying different Content Transfer Encodings and charsets would probably complicate your servlet (for example) making it hard to maintain. Everything has pros and cons... – Paweł Dyda Nov 29 '10 at 20:20
sure, it's easiest to use utf8 for everything, but depending on your demographic and size, using utf16 may be worth it. – dan_waterworth Nov 29 '10 at 20:23
@dan: also for other languages. Basic latin (the ASCII part) is only 1 byte and the majority of the remnant are only 2-3 bytes. UTF-16 is 4 bytes for every character. See also [this answer](http://stackoverflow.com/questions/3569718/difficulties-inherent-in-ascii-and-extended-ascii-and-unicode-compatibility/3574856#3574856) for a summary. @Pawel: true, there are inconsitenties, I was just referring to the sole answer posted by dan. – BalusC Nov 29 '10 at 22:36
2

errr... utf-16 is 2 bytes per character. your thinking of utf-32 – dan_waterworth Nov 30 '10 at 07:34
1

UTF-16 is 2 bytes per *code unit*. A character may require 1 or 2 UTF-16 code units. – dan04 Dec 10 '10 at 04:21
Correct, but most (in fact, the vast majority of) characters require only one code unit. You definitely don't need 4 bytes for every character. – dan_waterworth Dec 10 '10 at 08:22

UTF-8 uses and alternatives

4 Answers4