1649

In order to define charset for HTML5 Doctype, which notation should I use?

  1. Short:

    <meta charset="utf-8" /> 
    
  2. Long:

    <meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
    
CuriousMind
  • 33,537
  • 28
  • 98
  • 137
  • 110
    Using a tag for something like content-type and encoding is highly ironic, since without knowing those things, you couldn't parse the file to get the value of the meta tag. – Mark Jan 14 '11 at 22:24
  • 334
    You can parse it as ASCII until you reach it. The HTML5 parsing algorithm takes this into account. – Quentin Jan 14 '11 at 22:25
  • 46
    Noted should be that neither is been used for parsing when the page is served over web. Instead, the one in HTTP `Content-Type` response header will be used. The meta tag is only used when the page is loaded from local disk file system. – BalusC Jan 14 '11 at 22:31
  • 39
    The meta element is used over HTTP under certain conditions (including an absence of the data being in the HTTP header) – Quentin Jan 14 '11 at 23:16
  • 5
    If your HTML files are destined for Kindle e-books, you'll need to use the `http-equiv` version. –  Nov 05 '12 at 03:36
  • 82
    It is also ironic that it is named charset, when it really is for specifying an encoding. (the charset is Unicode, the encoding is UTF-8) – Ryan Mar 20 '13 at 15:02
  • 3
    Although its not required for HTML5, it's more an XHTML thing, Consider closing the elements, ie . Avoids lots of warnings in certain editors for elements that are not Void elements (
    etc).
    – Rob Von Nesselrode May 17 '13 at 04:36
  • 6
    @Quentin: And if, for some strange reason, you want to encode your page in UTF-16 or UTF-32? I agree with Mark, the concept of using encoded data to describe its own encoding is silly, though we can usually get away with it here. But I think it's there partially because the server ultimately will have the same problem, unless the server has some other means of identifying/enforcing encoding. – lyngvi Oct 16 '13 at 03:57
  • Using the long declaration for XHTML 1.0 strict works as expected. – RealDeal_EE'18 Nov 25 '13 at 23:40
  • 11
    Best practice is for the meta charset tag to be the first tag in the head per http://www.joelonsoftware.com/articles/Unicode.html and https://code.google.com/p/doctype-mirror/wiki/MetaCharsetAttribute. Basically, it needs to appear in the first 512 bytes, as early as possible, then the document will be parsed with the correct encoding. – BF4 Dec 06 '13 at 21:42
  • 1
    @Quentin Exactly. That's why the content-type element is required to be within the first 100 bytes of the document. – jackvsworld Jan 31 '14 at 20:57
  • as of php 5.4.22 `DOMDocument` does not get the long one :( – Timo Huovinen Mar 13 '14 at 07:34
  • 1
    Is there any harm in specifying both Content-Type: text/html; charset="utf-8" as an HTTP Header and having a meta tag on page (ie: )? I don't know if my hosting company adds the HTTP Header to specify UTF-8 and I have the meta tag on my pages. Didn't know if both was any issue –  Aug 12 '15 at 18:18
  • 3
    The **very best thing** to do, would actually be to ignore all this headers, meta-tags nonsense and **use the Unicode BOM**. The unicode BOM is standardized at the lowest possible level, the Unicode spec itself and *should* therefore work *everywhere* instead of just in (X)HTML or over HTTP. It would work for scripts, stylesheets, text/plain documents, over HTTP, TCP, mail, you name it. The only problem is that some legacy software chokes on the BOM... But... If we all just start to use it we force the vendors to fix it. – Stijn de Witt Dec 18 '15 at 16:23
  • 5
    @StijndeWitt: And how, exactly, will the Unicode BOM help you if you need to support other encodings, such as ISO-XXX, or Japanese encodings? Also, while the BOM is standardized, the standard actually advises against using a BOM with UTF-8; see e.g. the answer to [What's different between UTF-8 and UTF-8 without BOM?](http://stackoverflow.com/questions/2223882/whats-different-between-utf-8-and-utf-8-without-bom/2223926). – sleske Aug 30 '16 at 11:26
  • 1
    @sleske I think the standard's authors felt, at the time they wrote that faq, that using UTF-8 without BOM would give the best interoperability with old software, because it would match ASCII. But we are over a decade further now and UTF8 support is virtually ubiquitous. I stand by my comment that the BOM is the best place to store the encoding, because it survives over network, file systems and even databases. I still add HTTP headers and even a meta tag though. – Stijn de Witt Sep 01 '16 at 08:39
  • utf-8 does not have a BOM: As there is only one byte order (no big/little endien); because ascii is utf-8, and the BOM is not ascii. This will break pages that are just ascii. Some systems use ascii/utf-8 and adding a bom will break some old software). These systems have built on the old to produce a very good and robust system, with no need to through out the old, every time a new feature is added. – ctrl-alt-delor Nov 27 '16 at 14:33
  • 2
    UTF-8 **does** have a BOM. It's purpose is not to determine byte order but it serves the dual purpose to establish that the encoding used is UTF-8. *"UTF-8 can contain a BOM. However, it makes no difference as to the endianness of the byte stream. UTF-8 always has the same byte order. An initial BOM is only used as a signature — an indication that an otherwise unmarked text file is in UTF-8."* http://unicode.org/faq/utf_bom.html#bom5 – Stijn de Witt Nov 02 '17 at 09:36
  • 1
    Also, do note that ASCII is a subset of UTF-8, but the reverse is obviously not true. So if your text only contains ASCII, leave out the BOM (making it effectively ASCII). As soon as your text may contain non-ASCII characters, backward compatibility is broken anyway and you should add a BOM. – Stijn de Witt Nov 02 '17 at 09:39
  • 2
    One reason HTML files have an encoding even though supposedly http is supposed to specify the encoding is the majority of users don't have control of their servers. Rather than the boil the ocean solution of requiring every server to somehow allow users to specify an encoding for every file served it became clear that users needed a way to specify the encoding in the file itself. As for bom in utf-8 tons of software fails with it even in 2019. Whether or not there is some engineering ideal the pragmatic solutions are charset in HTML file, no bom ever for utf-8 for any file ever. – gman Feb 25 '19 at 06:52

8 Answers8

1140

In HTML5, they are equivalent. Use the shorter one, as it is easier to remember and type. Browser support is fine since it was designed for backwards compatibility.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Quentin
  • 914,110
  • 126
  • 1,211
  • 1,335
  • 30
    What about browser support? Does `` work in IE6? – Šime Vidas Jan 14 '11 at 22:13
  • 4
    Here is an updated link for the [Google Code page](http://code.google.com/p/doctype-mirror/wiki/MetaCharsetAttribute) that @Šime Vidas mentioned. It says, regarding IE 6, 7, and 8, "In non-IE browsers, you can use document.characterSet. In IE, you might think you could document.getElementsByTagName('meta')[0].charset, but this only returns the character encoding you specified, not the encoding that IE is actually using." – hotshot309 Jun 05 '12 at 13:51
  • 8
    I know this thread is old, but http://gtmetrix.com/specify-a-character-set-early.html indicates using `` to set the character encoding disables the lookahead downloader in IE8, which can impact your page load times. Yeah, yeah, I know... drop IE8. @MészárosLajos can come back here in a couple of years and bust our balls for still supporting IE8. ;-) – erturne Mar 05 '14 at 02:38
  • 2
    https://developer.mozilla.org/en-US/docs/Web/Guide/HTML/Obsolete_things_to_avoid was a nice confirmation of this answer for me. – Brendan Feb 05 '15 at 15:32
  • 5
    Today I had an issue where Korean symbols weren't appearing in IE11. Dropping the short syntax in favour of the longer syntax fixed the issue. I don't know if this is due to some kind of server config though or if it is an issue with IE11 and the charset. The exact symbol combination it was failing on was 베라. – James Donnelly Mar 05 '15 at 22:39
  • Out with the old in with the new. Demand change for the better. It is easier does the same thing and if your living in a cave with old technology... TOO BAD! Demand change for the better. – Chef_Code Aug 03 '15 at 21:16
  • 1
    I have found that Chrome prefers the "Long" form and Firefox prefers the "Short" form, and their preferences are mutually exclusive. I found this with UTF-8 inside SVG. The "Long" form on an HTML5 doctype didn't work in Firefox, and the "Short" form on an HTML5 doctype didn't work in Chrome, I had to use both to get both browsers to work. – derekm Sep 15 '15 at 20:41
  • 1
    And today I stumbled upon excel spreadsheet generated from template with short syntax being broken if generated on linux server, local windows machine did well. Changing for long syntax fixed encoding in output file – zakius Oct 16 '15 at 12:17
  • Why `charset` in meta tag important?, where it is used?, or what is the advantage of `charset` in html – 151291 Mar 12 '16 at 06:45
  • When in doubt, I would generally go for the simpler option. But since people are reporting problems with each option, why not just have both? – Rolf Oct 12 '16 at 11:33
  • Use the longer one when parsing the doc on the server side or serving it for servers as these are often outdated. – Timo Huovinen Apr 19 '18 at 13:36
258

Both forms of the meta charset declaration are equivalent and should work the same across browsers. But, there are a few things you need to remember when declaring your web files character-set as UTF-8:

  1. Save your file(s) in UTF-8 encoding without the byte-order mark (BOM).
  2. Declare the encoding in your HTML files using meta charset (like above).
  3. Your web server must serve your files, declaring the UTF-8 encoding in the Content-Type HTTP header.

Apache servers are configured to serve files in ISO-8859-1 by default, so you need to add the following line to your .htaccess file:

AddDefaultCharset UTF-8

This will configure Apache to serve your files declaring UTF-8 encoding in the Content-Type response header, but your files must be saved in UTF-8 (without BOM) to begin with.

Notepad cannot save your files in UTF-8 without the BOM. A free editor that can is Notepad++. On the program menu bar, select "Encoding > Encode in UTF-8 without BOM". You can also open files and re-save them in UTF-8 using "Encoding > Convert to UTF-8 without BOM".

More on the Byte Order Mark (BOM) at Wikipedia.

Honest Abe
  • 8,430
  • 4
  • 49
  • 64
CodeBoy
  • 2,613
  • 1
  • 14
  • 2
  • 21
    @CodeBoy I would amend your answer to say "You **should** save...without BOM." The following page says "...it is usually best for interoperability to omit the BOM..." indicating a best practice, but not a requirement: http://www.w3.org/International/questions/qa-byte-order-mark – Johann Jun 04 '12 at 18:49
  • 3
    In IIS you can set the charset in HTTP headers with in Web.Config - add it to – Chris Moschini Apr 20 '13 at 03:37
  • I just spent 30 mins trying to figure out why your charset tip was not working for me. You may have to rename default.html to index.html (or another file name). It seems Apache is hard set on certain defaults when it comes to default.html! – Ivan Dossev Apr 30 '13 at 08:17
  • 3
    as I understand things, it doesn't matter AT ALL if you save with our without BOM. – David 天宇 Wong Jun 23 '13 at 16:02
  • 1
    Honestly, I'd always prefer an easy to configure webserver over something like apache. @Dabbu – dom0 Aug 14 '13 at 12:03
  • Thanks! These info's helpful me as I'm developing my live html/css/js code editor (http://liveditor.com). The last time I tried the php parser (.dll) have issue processing file's in UTF8 with BOM - it outputs the BOM bytes! I don't understand why it can't detect the BOM... – Edwin Yip Oct 18 '13 at 07:46
  • 1
    The BOM *does* make a difference in certain contexts. It is needed when dealing with UTF-16 because [RFC 2781, section 4.3](http://tools.ietf.org/html/rfc2781), says the default encoding is big-endian, but since Windows uses little-endian by default, most software will use LE as well. To avoid any wrong interpretation of the content, the BOM comes very handy. It can be harmful in certain conditions as well, as when using PHP, the interpreter sometimes outputs the BOM and gives you errors when you try to output some HTTP Header. Summing up: don't use BOM for UTF-8; don't forget BOM for UTF-16. – diego nunes Oct 20 '13 at 07:08
  • 3
    Why do you say UTF-8 HTML should be without a BOM. Having a BOM should work fine. Also, you don't need `meta` and an HTTP header. You just need one of BOM, `meta` or HTTP header. – hsivonen Nov 28 '13 at 09:29
  • How do you get Visual Studio to stop being evil and always adding a UTF-8 BOM? For Tomcat you must also add `URIEncoding="utf-8"` to each connector. – Brett Ryan Mar 03 '14 at 19:00
  • You specify the HTML content-type, why do you use meta charset for that? I think it's redundant., right? – Chao Mar 26 '15 at 06:18
  • @Richard one issue with using the header only is that the encoding will be lost if a user saves the html file to disk. Using the meta tag only is okay, but it makes the browser do a bit of extra parsing. So I think using both should be considered a best practice, despite the redundancy. – Daniel Lubarov May 16 '15 at 23:04
  • 2
    `Why do you say UTF-8 HTML should be without a BOM` Indeed, the absence of the BOM is the very reason you would need an HTTP header or meta tag in the first place. – Stijn de Witt Aug 18 '15 at 23:45
  • 5
    `Summing up: don't use BOM for UTF-8` I can't agree with this. The BOM in UTF-8 is very useful for signaling the encoding type. Otherwise we have to guess, or use things like the meta tags this question refers to. The cool thing about the BOM is that it is part of the Unicode spec and thus can be used for all data encoded in Unicode, not just HTML. What we *should* do is use BOMs everywhere, let legacy software blow up on it, report those bugs and get them fixed. – Stijn de Witt Aug 18 '15 at 23:58
  • @StijndeWitt Not to reignite a holy war over BOM, but just a word of caution: in many cases, the BOM is often invisible to, or otherwise often overlooked by other developers. This can lead to issues if you don't explicitly bring it up to your team. One example is when serving files (such as via PHP and Apache), the BOM in a file may immediately begin the data stream, overriding any server script config/include/header lines that you mean to be parsed before transmitting any data. – Beejor May 25 '19 at 15:55
88

Another reason to go with the short one is that it matches other instances where you might specify a character set in markup. For example:

<script type="javascript" charset="UTF-8" src="/script.js"></script>

<p><a charset="UTF-8" href="http://example.com/">Example Site</a></p>

Consistency helps to reduce errors and make code more readable.

Note that the charset attribute is case-insensitive. You can use UTF-8 or utf-8, however UTF-8 is clearer, more readable, more accurate.

Also, there is absolutely no reason at all to use any value other than UTF-8 in the meta charset attribute or page header. UTF-8 is the default encoding for Web documents since HTML4 in 1999 and the only practical way to make modern Web pages.

Also you should not use HTML entities in UTF-8. Characters like the copyright symbol should be typed directly. The only entities you should use are for the five reserved markup characters: less than, greater than, ampersand, prime, double prime.

Entities need an HTML parser, which you may not always want to use going forward. They introduce errors, make your code less readable, increase your file sizes, and sometimes decode incorrectly in various browsers depending on which entities you used. Learn how to type/insert copyright, trademark, open quote, close quote, apostrophe, em dash, en dash, bullet, Euro, and any other characters you encounter in your content, and use those actual characters in your code.

The Mac has a Character Viewer that you can turn on in the Keyboard System Preference, and you can find and then drag and drop the characters you need, or use the matching Keyboard Viewer to see which keys to type. For example, trademark is Option + 2. UTF-8 contains all of the characters and symbols from every written human language.

So there is no excuse for using -- instead of an em dash. It is not a bad idea to learn the rules of punctuation and typography also ... for example, knowing that a period goes inside a close quote, not outside.

Using a <meta> tag for something like content-type and encoding is highly ironic, since without knowing those things, you couldn't parse the file to get the value of the meta tag.

No, that is not true. The browser starts out parsing the file as the browser's default encoding, either UTF-8 or ISO-8859-1. Since US-ASCII is a subset of both ISO-8859-1 and UTF-8, the browser can read <html><head> just fine either way ... it is the same. When the browser encounters the meta charset tag, if the encoding is different than what the browser is already using, the browser reloads the page in the specified encoding.

That is why we put the meta charset tag at the top, right after the head tag, before anything else, even the title. That way you can use UTF-8 characters in your title.

You must save your file(s) in UTF-8 encoding without BOM

That is not strictly true. If you only have US-ASCII characters in your document, you can Save it as US-ASCII and serve it as UTF-8, because it is a subset. But if there are Unicode characters, you are correct, you must Save as UTF-8 without BOM.

If you want a good text editor that will save your files in UTF-8, I recommend Notepad++.

On the Mac, use Bare Bones TextWrangler (free) from Mac App Store, or Bare Bones BBEdit which is at Mac App Store for $39.99 ... very cheap for such a great tool.

In either app, there is a menu at the bottom of the document window where you specify the document encoding and you can easily choose "UTF-8 no BOM". And of course you can set that as the default for new documents in Preferences.

But if your Webserver serves the encoding in the HTTP header, which is recommended, both [meta tags] are needless.

That is incorrect. You should of course set the encoding in the HTTP header, but you should also set it in the meta charset attribute so that the page can be saved by the user, out of the browser onto local storage and then opened again later, in which case the only indication of the encoding that will be present is the meta charset attribute.

You should also set a base tag for the same reason ... on the server, the base tag is unnecessary, but when opened from local storage, the base tag enables the page to work as if it is on the server, with all the assets in place and so on, no broken links.

AddDefaultCharset UTF-8

Or you can just change the encoding of particular file types like so:

AddType text/html;charset=utf-8 html

A tip for serving both UTF-8 and Latin-1 (ISO-8859-1) files is to give the UTF-8 files a "text" extension and Latin-1 files "txt."

AddType text/plain;charset=iso-8859-1 txt
AddType text/plain;charset=utf-8 text

Finally, consider saving your documents with Unix line endings, not legacy DOS or (classic) Mac line endings, which don't help and may hurt, especially down the line as we get further and further from those legacy systems.

An HTML document with valid HTML5, UTF-8 encoding, and Unix line endings is a job well done. You can share and edit and store and read and recover and rely on that document in many contexts. It's lingua franca. It's digital paper.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Simon White
  • 905
  • 6
  • 3
  • 21
    "If you only have ISO-8859-1 characters in your document, you can Save it as ISO-8859-1 and serve it as UTF-8, because it is a subset" - incorrect. It would be correct if you change "ISO-8859-1" to "US-ASCII". US-ASCII is compatible with UTF-8 because it is a subset, ISO-8859-1 is not. To convert ISO-8859-1 (containing non-ASCII characters) to UTF-8, you would need to encode the non-ASCII characters. The code points for ISO-8859-1 do exist in Unicode, but UTF-8 encodes the ones outside of US-ASCII differently to ISO-8859-1. – thomasrutter Jun 21 '12 at 05:29
  • 2
    Your point about HTML entities is good. In the past, I've used entities only to find that they were converted to their UTF-8 characters after being saved on different systems and/or opened in different editors. It's worth noting, however, that non-breaking spaces ( ) can produce confusing results since you typically won't see them in your editor so are usually best to keep as entities for clarity's sake (in my experience). – squidbe Dec 07 '12 at 23:11
  • `"You should also set a base tag..."` should come with the caveats described [here](http://stackoverflow.com/questions/1889076/is-it-recommended-to-use-the-base-html-tag). – Mafuba Mar 18 '13 at 23:39
  • Another reason you might prefer HTML entities is if you're using something like [ionicons](http://ionicons.com/). I'd rather see `` than the default glyph, or some strange character I don't recognize. – Daniel Lubarov May 16 '15 at 23:17
33

<meta charset="utf-8"> was introduced with/for HTML5.

As mentioned in the documentation, both are valid. However, <meta charset="utf-8"> is only for HTML5 (and easier to type/remember).

In due time, the old style is bound to become deprecated in the near future. I'd stick to the new <meta charset="utf-8">.

There's only one way, but up. In tech's case, that's phasing out the old (really, REALLY fast)

Documentation: HTML meta charset Attribute—W3Schools

Omar
  • 11,783
  • 21
  • 84
  • 114
  • 3
    Regarding the link, please see http://meta.stackoverflow.com/questions/280478/why-not-w3schools-com – tripleee Dec 17 '15 at 10:45
21

While not contesting the other answers, I think the following is worthy of mentioning.

  1. The “long” (http-equiv) notation and the “short” one are equal. Whichever comes first wins;
  2. Web server headers will override all the <meta> tags;
  3. BOM (byte order mark) will override everything, and in many cases it will affect HTML 4 (and probably other stuff, too);
  4. If you don't declare any encoding, you will probably get your text in “fallback text encoding” that is defined your browser. Neither in Firefox nor in Chrome it's UTF-8;
  5. In absence of other clues the browser will attempt to read your document as if it was in ASCII to get the encoding, so you can't use any weird encodings (UTF-16 with BOM should do, though);
  6. While the specifications say that the encoding declaration must be within the first 512 bytes of the document, most browsers will try reading more than that.

You can test by running echo 'HTTP/1.1 200 OK\r\nContent-type: text/html; charset=windows-1251\r\n\r\n\xef\xbb\xbf<!DOCTYPE html><html><head><meta http-equiv="Content-Type" content="text/html; charset=utf-8"><meta charset="windows-1251"><title>привет</title></head><body>привет</body></html>' | nc -lp 4500 and pointing your browser at localhost:4500. (Of course you will want to change or remove parts. The BOM part is \xef\xbb\xbf. Be wary of the encoding of your shell.)

Please mind that it's very important that you explicitly declare the encoding. Letting browsers guess can lead to security issues.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
squirrel
  • 5,114
  • 4
  • 31
  • 43
  • 1
    Good points, but can you detail which security issues are you referring to? – Armfoot Feb 04 '16 at 14:33
  • 1
    The long notation shouldn't override the short one—simply the first one in the document should win. – gsnedders Aug 18 '16 at 00:53
  • 1
    @Armfoot In the past there used to be problems with `UTF-7` from what I remember. Also sniffing on the web is generally bad, e.g. when you upload an image something which is sniffed as script content. – phk Sep 23 '16 at 16:43
  • @gsnedders tested in chrome and firefox, you're right. edited the answer accordingly. Armfoot: it was something about some 7 bit encoding, don't remember what exactly. – squirrel Oct 14 '16 at 17:20
  • "Neither in Firefox nor in Chrome it's utf-8" — What do you mean? If not utf-8, what is it then? – Craig McQueen Aug 21 '17 at 02:00
  • 1
    @CraigMcQueen pretty sure the browser fallback still (in 2018) defaults to Western European in Western Europe, so I imagine it defaults to whatever pre-unicode encoding has been dominant in each region. Users can set the fallback to utf-8 but this just exposes all the crappy encoding thousands of sites still use as glitchy high byte ascii characters all over, so it is still not common. More's the pity. Can't see how this is going to change without a little coercion from the browser vendors, and they're not keen on breaking legacy stuff. – brennanyoung Aug 13 '18 at 09:12
16

Use <meta charset="utf-8" /> for web browsers when using HTML5.

Use <meta http-equiv="Content-Type" content="text/html; charset=utf-8" /> when using HTML4 or XHTML, or for outdated DOM parsers, like DOMDocument in PHP 5.3.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Timo Huovinen
  • 53,325
  • 33
  • 152
  • 143
6

To embed a signature in an email, I would use the long version:

<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />

The reason is that not many email readers use HTML5, so it's always better use old HTML styles. Actually, it's better to use tables than divs + CSS as well.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
chelder
  • 3,819
  • 6
  • 56
  • 90
2

There is some news based on Mozilla Foundation, and SitePoint:

Do not use this value (http-equiv=content-type) as it is obsolete. Prefer the charset attribute on the <meta> element.

Enter image description here

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
user10089632
  • 5,216
  • 1
  • 26
  • 34
  • 3
    oh finally, something a bit more recent – Ayyash Mar 31 '20 at 05:57
  • 1
    The [XHTML1 standard](https://www.w3.org/TR/xhtml1/#C_9) says otherwise, don't fall for the Mozilla-only view on things - they don't even explain their note. Whoever dealt with XML (to understand XHTML) should know it comes with an [`encoding` parameter](https://www.w3.org/TR/xml/#NT-EncodingDecl) right away. – AmigoJack May 15 '22 at 08:53
  • Turns out the warning is not the mozilla foundation website anymore? – Marcosaurios Jul 06 '23 at 12:48