Encoding recognition using heuristic methods over xml/html?

Question

Say I have a XML file which is stored on a remote computer. I don't know how this file was saved. ( what encoding was used).

I want to read this file and do some operations with that XML file.

But then I thought : , OK , how would I be able to read the encoding part from : <?xml version="1.0" encoding="xxxxx"?> if I don't know to analyze the bytes on hard drive...

After a small discussion with Jon I was told that encoding can be automatic inferred between UTF-8 and UTF-16, and those are the only ones the XML specification dictates are okay to leave out.

Which led me to ask : what about other encodings ? if that XML was saved in encoding-lala , how would I able to know it ?

As Jon referenced me to the w3c article - I did find an answer:

The XML encoding declaration functions as an internal label on each entity, indicating which character encoding is in use. Before an XML processor can read the internal label, however, it apparently has to know what character encoding is in use—which is what the internal label is trying to indicate.

It does it via :

Because each XML entity not accompanied by external encoding information and not in UTF-8 or UTF-16 encoding must begin with an XML encoding declaration, in which the first characters must be <?xml, any conforming processor can detect, after two to four octets of input, which of the following cases apply. In reading this list, it may help to know that in UCS-4, < is #x0000003C and ? is #x0000003F, and the Byte Order Mark required of UTF-16 data streams is #xFEFF.

So it does use heuristic methods to get the encoding via trying to get the appropriate <?xml string.

Another helpful info which helps it was the the structure of the encoding declaration :

Notice the regex , (basic ascii 0..127) chars and encoding word.

enter image description here

So here is my question :

even if It saved as utf-8/16/blabla - it DOES SUCCEED to recognize the encoding using first bytes (heuristics or not).

If so , why still <?xml version="1.0" encoding="xxxxx"?> is needed ?

@Tomalak and those ascii chars , in which encoding they were saved ? see the table here http://www.w3.org/TR/xml/#sec-guessing. ( also please read my discussion with Jon). — Royi Namir, Jan 26 '14 at 08:14

Louis · Answer 1 · 2014-01-29T02:11:52.280

It is needed because the heuristic cannot always fully determine what the encoding is going to be. For instance for the sequence without a byte order mark that goes 00 3C 00 3F, the spec says that the encoding is:

UTF-16BE or big-endian ISO-10646-UCS-2 or other encoding with a 16-bit code unit in big-endian order and ASCII characters encoded as ASCII values (the encoding declaration must be read to determine which)

(Emphasis added.)

Actually, without a byte order mark, it looks like in all cases (except for the case Other) the encoding declaration must be read. It's just not made very prominent in the text of the spec.

In cases where the heuristic is not enough for complete determination, it is nevertheless enough for the parser to adjust its decoding just enough to be able read the encoding declaration and make a final determination on the encoding. (The spec actually says as much.)

Simon Mourier · Accepted Answer · 2014-01-29T08:53:24.557

You need two encodings to read an XML file (I will not mention the BOM which is just another hint that simplify things):

1) the first encoding is used to read the XML declaration. It's more a byte-encoding oriented encoding because you only need to read US-ASCII characters. You have a bunch of bytes, and you need to read a bunch of ASCII characters.

Note it works because encoding names can only contain US-ASCII characters (IANA Character Sets). For example, at that stage, you don't really need to differentiate between UTF-8 and US-ASCII because they encode ASCII characters the same way.

So, the number of encodings to test here is limited, because you focus on byte -> ASCII (1 byte -> 1 char, 2 bytes -> 1 char, 4 bytes -> 1 char, etc.) character conversion, not the whole Unicode set. The encoding you will use here may not be used for the rest of the file.

At that point for example, you will not be able to differentiate a file using the Windows-1252 encoding from a file using the ISO-8859-1 encoding. For this you need to read the encoding name.

2) the second encoding is used to read the rest of the file.

I've asked a comment 2 min ago , I'll paste here again (:-) ) ---Still , small question : I'm the parser. I've read first 3 bytes. Im doing heuristics until I reach ` — Royi Namir, Jan 29 '14 at 07:40
You can have an encoding name that is not consistent with the first bytes of the file (for example if the first bytes where encoded one byte per char, and the encoding says UTF-16). In this case, the XML file is incorrect. — Simon Mourier, Jan 29 '14 at 07:42
This key has limited power, and is only good to read the encoding name if you will. — Simon Mourier, Jan 29 '14 at 07:53

Encoding recognition using heuristic methods over xml/html?

2 Answers2