52

What should be used and when? Or is it always better to use UTF-8? Or ISO-8859-1 still has importance in specific conditions?

Is the haracter set related to geographic region?


Is there a benefit to using the code @charset "utf-8";?

Or like this <link type="text/css; charset=utf-8" rel="stylesheet" href=".." />

at the top of the CSS file?

I found for this

If Dreamweaver adds the tag when you add embedded style to the document, that is a bug in Dreamweaver. From the W3C FAQ:

"For style declarations embedded in a document, @charset rules are not needed and must not be used."

The charset specification is a part of CSS since version 2.0 (may 1998), so if you have a charset specification in a CSS file and Safari can't handle it, that's a bug in Safari.

And add accept-charset in the form:

<form action="/action" method="post" accept-charset="utf-8">

And what should be used if I use the XHTML doctype?

<?xml version="1.0" encoding="UTF-8"?>

or

<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Jitendra Vyas
  • 148,487
  • 229
  • 573
  • 852

5 Answers5

59

Unicode is taking over and has already surpassed all others. I suggest you hop on the train right now.

Note that there are several flavors of unicode. Joel Spolsky gives an overview.

Unicode is winning (Graph current as of Feb. 2012, see comment below for more exact values.)

Lemmings19
  • 1,383
  • 3
  • 21
  • 34
nes1983
  • 15,209
  • 4
  • 44
  • 64
  • 9
    The majority of the Web is UTF-8 now: http://w3techs.com/technologies/overview/character_encoding/all – dan04 Aug 03 '10 at 02:40
  • 4
    Just to be absolutely crystal clear, what is meant by "flavors of Unicode" is that there are different ways to encode Unicode. – Peter Sep 23 '11 at 05:56
  • Thanks for the link to the most concise and appropriately named article I have seen in a while. – atw Sep 12 '16 at 13:27
8

UTF-8 is supported everywhere on the web. Only in specific applications is it not. You should always use UTF-8 if you can.

The downside is that for languages such as Chinese, UTF-8 takes more space than, say, UTF-16. But if you don't plan on going Chinese, or even if you do go Chinese then UTF-8 is fine.

The only cons against using UTF-8 is that it takes more space compared to various encodings, but compared to western languages it takes almost no extra space at all, except for very special characters, and those extra bytes you can live with. We are in 2009 after all. ;)

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Tor Valamo
  • 33,261
  • 11
  • 73
  • 81
  • 1
    Strictly speaking that's not the only con. Another con is that it's a variable-length encoding and some old code still stumbles across that fact. – Joachim Sauer Dec 12 '09 at 18:01
  • Yes, but as I said, I'm speaking about utf-8 on the web, and not in programming. ;) – Tor Valamo Dec 12 '09 at 18:23
  • @Joachin Sauer, either you support the encoding or you don't. Yes, all ASCII is valid UTF-8, but why would one expect to successfully decode UTF-8 using an ASCII decoder? – Peter Sep 23 '11 at 06:11
  • Utf-8 is widely supported on the web but does UTF-8 support all the characters (ex. using different languages)? I was just wondering. – kta Feb 11 '14 at 03:15
  • @kta - yes, utf-8 is currently the encoding with the largest symbol set, and new (albeit obscure) scripts are added annually. Technically, UTF-8 can have up to 6 bytes (48 bits) per symbol (before it runs out of space for positional metadata per byte), where something like 24-ish of them can have data (the others are metadata), so there's a possibility of up to 2^24 symbols. PS: the numbers might be off, but you get the picture. – Tor Valamo Mar 21 '14 at 18:50
3

If you want world domination, use UTF-8 all the way, because this covers every human character available at the world, including Asian, Cyrillic, Hebrew, Arabic, Greek and so on, while ISO-8859 is only restricted to Latin characters. You don't want to have Mojibake.

BalusC
  • 1,082,665
  • 372
  • 3,610
  • 3,555
  • but if some character not showing in utf-8 in website then should i change charset utf-8 to ISO-8859 for just some character or is there any other solution? – Jitendra Vyas Dec 12 '09 at 16:50
  • 1
    @BalusC, actually you have to go to UTF-16 to be able to cover "evry human character available in the world." – Rob Wells Dec 12 '09 at 16:50
  • @Rob Wells - So should we use UTF-16? – Jitendra Vyas Dec 12 '09 at 16:52
  • 3
    @Rob: No, UTF-8 has every human character. The only difference is the way UTF-16 saves space on languages such as chinese, with different code points. UTF-16 is a very unstable charset, because it doesn't know when there is an error. – Tor Valamo Dec 12 '09 at 16:53
  • 1
    This is a very rare case as UTF-8 covers the same codepoints as the characters of ISO-8859-1 (but NOT all of the other ISO-8859-x sets!). Just use UTF-8 all the way and convert the "bad" characters if necessary. In terms of web development you need to ensure of at least the following: 1) save source code files in UTF-8. 2) set HTTP response header to UTF-8. 3) set HTTP request header to UTF-8 (if not set by client yet). 4) set database table to UTF-8. – BalusC Dec 12 '09 at 16:53
  • Oh, and 5) read/write local textfiles using UTF-8. I am not sure what your target language is, but if it was Java, you can find here more background information, practical examples and detailed solutions: http://balusc.blogspot.com/2009/05/unicode-how-to-get-characters-right.html – BalusC Dec 12 '09 at 16:58
  • 2
    @BalusC: No, ALL ISO-8859-x characters, for any value of x, are also Unicode characters. All Unicode characters have a number/codepoint, and UTF-8 is just a variable-length encoding of that number. Therefore it follows that all of the +/- 800 characters in the different ISO-8859-x encodings have a UTF-8 encoding. – MSalters Jan 26 '10 at 16:17
  • @MSalters: Uh, that wasn't the point. I was talking about the **character** which is represented by the codepoint. – BalusC Jan 26 '10 at 16:52
  • In that case UTF-8 is irrelevant; it's merely an encoding. It encodes all Unicode characters. Each ISO-8859-x character set is a 256 character subset of Unicode; therefore each character from any ISO-8859-x has a Unicode codepoint, and therefore a UTF-8 encoding. This directly contradicts your "UTF-8 covers the same codepoints as the characters of ISO-8859-1 (but NOT all of the other ISO-8859-x sets!)" statement. If you still doubt it, please name 1 character from any ISO 8859 that is "not coveerd by UTF-8" – MSalters Jan 27 '10 at 08:35
  • @MSalters: This is a misunderstanding. The characters which are represented by the codepoints in ISO-8859-1 are exactly the same as in UTF-8. In for example ISO-8859-15 however, eight codepoints got a different character. E.g. codepoint `0xA4` got the euro sign `€` instead of the generic currency sign `¤`. – BalusC Jan 27 '10 at 11:24
  • 2
    Ohoh. You're sorely mistaken then about UTF-8. To use your same example, 0xA4 is NOT a valid UTF-8 character. It can be the second, third or forth byte of a UTF-8 character. For instance U+20A4 `₤` is the three-byte UTF-8 sequence 0xE2,0x82,0xA4, and the currency sign U+00A4 `¤` is the two-byte UTF-8 sequence 0xC2, 0xA4. (It's a coincidence the 0xA4 repeats; U+00E4 is NOT 0xC2, 0xE4 for instance) – MSalters Jan 27 '10 at 11:52
  • Sigh. Yes, I know that UTF-8 is multibyte, but I wasn't talking about that at all. – BalusC Jan 27 '10 at 12:04
  • Well, that's the defining characteristic of the UTF-8 encoding. Similarly, UTF-16 is a multi-word encoding of Unicode. And UTF-8 being **multi** byte is precisely why it's possible that it covers all ISO-8859-x single-byte characters sets, not just -1 - see your comment on Dec 12th. – MSalters Jan 28 '10 at 09:03
1

I find ISO 8859-1 very useful on a couple of sites where I have clients sending me text files that were created in Word or Publisher, that I can easily insert into the midst of PHP code and not worry about it - especially where quotes are concerned.

These are local, U.S. companies, and there is literally no other difference in the text on the pages, and I see no disadvantage in using that character set on those particular pages. All others are UTF-8.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
RationalRabbit
  • 1,037
  • 13
  • 20
-2
  • ISO 8859-1 is a great encoding to use when space is a premium and you are only ever going to want to encode characters from the basic Latin languages it supports. And you are never ever ever going to ever have to ever contemplate ever upgrading your application to support non Latin languages.

  • UTF-8 is a fantastic way to (a) use the large code base of 8 bits per character code libraries there are that already exist, or (b) be a euro snob. UTF-8 encodes standard ASCII in one byte per character, Latin 1 in 2 bytes per character, Eastern European and Asian languages get three bytes per character. It possibly goes up to four bytes per character if you start trying to encode ancient languages that don’t exist in the basic multilingual plane.

  • UTF-16 is a great way to start a new codebase from scratch. It’s completely culture neutral - everyone gets a fair handed two bytes per character. It does need four bytes per character for ancient/exotic languages - which means - in the worst case - it’s as bad as its big brother:

  • UTF-32 is a waste of space.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Chris Becke
  • 34,244
  • 12
  • 79
  • 148
  • 2
    utf16 is *culture neutral*? Everyone gets a *fair-handed 2 bytes*? Rather than overlaying culture value judgments into the discussion, why not keep it to a concise cost/benefit analysis? To wit: If the characters being encoded are primarily ascii or latin, then UTF16 is a waste of space. If not, then not. Whether it is a "new codebase" is irrelevant. – Cheeso Dec 12 '09 at 17:34
  • utf16 has the advantage that you can move a cursor backwards in it. Shouldn't be neglected. – nes1983 Dec 12 '09 at 17:42
  • utf-16 is a very bad web encoding, because it is extremely incompatible with any other encoding, and if there is an error in the byte stream, it will not register that, and keeps going as if nothing happened, causing every subsequent character to be plain wrong. Even one missing bit does this. – Tor Valamo Dec 12 '09 at 17:46
  • 2
    UTF-16 is "completely culture neutral - everone gets a fair handed 2 bytes per character", except those cultures for which you need 4 bytes per character? Is this a parody of Orwell? :-) – Ken Dec 12 '09 at 17:47
  • UTF-32 (and related schemes) takes more space, but less time: random access is O(1), which is why many languages that support full Unicode characters tend to use this internally. – Ken Dec 12 '09 at 17:58
  • Niko: advantage over what? Can't you move a cursor backwards in UTF-8 and UTF-32, also? – Ken Dec 12 '09 at 18:01
  • @Ken: do they? Both Java and .NET use UTF-16. They don't use UTF-32! – Joachim Sauer Dec 12 '09 at 18:01
  • By the way: ISO-8859-1 isn't even enough when you only need latin languages. It doesn't support the Euro sign €, which is pretty darn important. For that you'd need to go to ISO-8859-15 (or better yet: an encoding that can represent all Unicode codepoints such as the UTF-* family) – Joachim Sauer Dec 12 '09 at 18:02
  • @Ken: in UTF-32 you can, but UTF-8 no. It's because UTF-8 is a variable-length code: http://en.wikipedia.org/wiki/Variable-length_code. You can only go forward in UTF-8. – nes1983 Dec 12 '09 at 18:35
  • @Niko: You can go backwards in UTF-8 as well, you just need to go back so many bytes until the most significant bit is again 0 (which means it's the last byte of the character). Slightly modified if the endianness is different. And moving forward (not appending) has that same problem if the endianness is different. But it's very solvable. – Tor Valamo Dec 12 '09 at 21:26
  • i should probably have said culture agnostic. I just mean that, with a web site its very easy for english people especially to assume that all users will be happy being restricted to latin 1. – Chris Becke Dec 12 '09 at 22:48
  • @Tor, actually you need to check the 2 most significant bits. if `b & 0xC0 == 0x80` then it's a continuation byte, all others are either a lead byte or invalid. Also, UTF-8 encodes/decodes exactly the same regardless of endianness. – Brian Reichle Feb 22 '14 at 05:30