3

In Dreamweaver I have the option "Include Unicode Signature (BOM)".

If I check this box and save the file the HTML file it looks good when viewed in the web browser. If not it gives me strange symbols for Swedish letters like åäö.

If I serve this HTML file with strange letters using the header respond "Content-Type: text/html; charset=utf-8" it still gives me strange symbols.

Q1) Does that mean that it's not a UTF-8 encoded file (the one without BOM that shows strange symbols)?

Q2) What makes a file UTF-8 encoded, is it just the Unicode signature (BOM)?

Q3) Should I or should I not add the Include Unicode Signature (BOM) in my files (HTML, Javascript, CSS, PHP)?

I know that I can add <meta charset="UTF-8"> in the HTML code or type AddDefaultCharset UTF-8 in my .htaccess. I just figure the optimal solution would be to have a header respond that says "it's a UTF-8 encoded file" and then also actually serve a UTF-8 encoded file. Nothing else.

Q4) I thought HTML files were plain text-files. What other information is hidden in those files and how can I read this information?

MarkoHiel
  • 15,481
  • 2
  • 21
  • 29
user1087110
  • 3,633
  • 11
  • 34
  • 43
  • You need to get an understanding of the difference between ASCII and Unicode--that will probably answer all of your questions. http://stackoverflow.com/questions/19212306/difference-between-ascii-and-unicode. Just Google "difference between ASCII and Unicode" and start reading... – Tony Hinkle May 20 '15 at 12:30
  • 1
    Have a look at [_The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)_](http://www.joelonsoftware.com/articles/Unicode.html). – matt May 20 '15 at 12:33
  • [What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text](http://kunststube.net/encoding/) – deceze May 20 '15 at 12:33
  • From the article: "But that meta tag really has to be the very first thing in the section because as soon as the web browser sees this tag it's going to stop parsing the page and start over after reinterpreting the whole page using the encoding you specified." Wouldn't it be better to serve a Content-Type http header that tells it's a UTF-8 encoded file and also serve a correct UTF-8 encoded file? Then the browser doesn't have to start over... How can I determine that the file is UTF-8 encoded (without checking the HTTP header response from the server or look for the inline meta tag)? – user1087110 May 20 '15 at 13:53

1 Answers1

3

The BOM is entirely optional for UTF-8. The Unicode consortium points out that it can create problems while offering no real advantage; the W3C says that it can be a substitute for other forms of declaring the encodings and should work on all modern browsers.

The BOM is only there to clarify the endianness of the encoding. Since UTF-8 only has one kind of endianness it is superfluous. It's only useful for UTF-16 and other encodings. A UTF-8 encoded file is UTF-8 encoded regardless of the presence of the BOM.

HTML files do not "hide" any other information, they're plain text.

My recommendation would be:

  • encode as UTF-8 without BOM
  • add the HTTP Content-Type header to denote the encoding of the file
  • also add the <meta> tag into the HTML itself as a fallback, should the file be interpreted outside of an HTTP context (meaning where no HTTP header exists because the file is not read over HTTP)

This gives you the best compatibility with the least potential for issues. If your characters are still appearing funny, then your file is not actually UTF-8 encoded or the HTTP header is not being set correctly.

deceze
  • 510,633
  • 85
  • 743
  • 889
  • 1
    BOM can help the server-side software, though - the PHP/ASP.NET/... case. It has to figure out what's the correct file encoding to parse whatever inlined characters you have. Ideally, you wouldn't have any such ambiguities in plain source-code, but... Once I figured out how much BOM can help, I've never turned back... – Luaan May 20 '15 at 12:45
  • 1
    That greatly depends on the server-side software. PHP doesn't care one bit about BOMs or encodings in general, Python has a special in-file annotation for it... If a BOM is useful to you, great. But in the given context of this question I don't see any. – deceze May 20 '15 at 12:48
  • Thanks for your answer. My Cache settings fooled me the header setting "text/html" was still there, instead of "text/html;charset=utf-8" wich I tought it was. Just a final clarification. How can I determine that the file is UTF-8 encoded (without checking the HTTP header response from the server or look for the inline meta tag)? – user1087110 May 20 '15 at 14:02
  • @user It's not possible to know what encoding a piece of text is in without any accompanying metadata. If all you have is a plain text file, then the best you can do is *guess*. That pretty much means: try to open the file in some encoding and see if all characters look valid. – deceze May 20 '15 at 14:05