Why do need metadata information specifying the encoding?

Question

I feel a bit of a chicken and egg problem if i write a html meta tag specifying charset as say UTF-16 - like how do we decode the entire HTTP Request in the first place if we didn't know its UTF-16 data ? I believe request header needs to handle this and by the time we try to read metadata like say html tag charset="utf-16" we already know its UTF-16 . Besides think one level higher about header information like Request Headers - are passed in ASCII as a standard ?

I mean at some level we need to agree upon and you can't set a data that is needed to decode as a metadata information . Can anyone clarify this ? I am a bit confused on the idea of specifying a data that is needed to interpret the whole data as a metadata information inside the original data .

In general how can any form of encoding work if we don't have a standard agreed upon language/encoding to convey the data about the data itself ?

For example I am informed that Apache default has 8859-1 as the standard . So would all client need to enforce that for HTTP Headers and interpret the real content as UTF-8 if we want UTF-8 for the content-type ?

What character encoding should I use for a HTTP header? is a closely related question

score 1 · Accepted Answer · answered Sep 30 '14 at 12:38

1

UTF-16 (and other) encodings use a BOM (Byte Order Mark) that is read at the start of the file and that signals which encoding is being used. Only after that, the encoded part of the file begins.

For example, for UTF-16, you'll have the bytes FE FF if big-endian and FF FE if little-endian words are being used.

You also often see UTF-8 BOMs, although they don't need to be used (and may confuse some XML parsers).

answered Sep 30 '14 at 12:38

Tim Pietzcker

328,213
58
503
561

So what is the standard these days? I mean do browsers and apache agree to use UTF-8 for anything everything these days or is it like Headers continue with 8859-1 and yet the real content is expected to be encoded as UTF-8 if its set to UTF-8 or even as default ? – Nishant Sep 30 '14 at 12:40
1

@Nishant: UTF-8 is a single-byte encoding, and it encloses ASCII as a subset, so the line where the encoding is declared is actually in ASCII. Save your file as UTF-8, and make sure to declare the encoding at its start. – Tim Pietzcker Sep 30 '14 at 12:42
When you say file which file do you mean ? Static HTML is hard used these days . All that we deal wit dynamic right ? And to get the file browser has to make a Request first which I guess is LATIN as per the Web Server configuration - which I am assuming has a standard that browser adheres to - otherwise some browser might not open some sites . Can you clarify - also helpful if you can answer as part of the original answer because that is one of the questions I have . – Nishant Sep 30 '14 at 12:59
1

The HTTP header has priority. If it says a particular charset, that is the charset used. The HTTP 1.1 spec requires that. If there is no HTTP header charset, the HTML header charset is used, if specified. Without a BOM or an HTTP header charset, a single-byte charset must be assumed so the HTML header can be parsed to discover its charset (as long as the HTML really is single byte encoded, otherwise a BOM or an HTTP header charset MUST be used). – Remy Lebeau Sep 30 '14 at 21:44

Why do need metadata information specifying the encoding?

1 Answers1