Say I have a XML file which is stored on a remote computer. I don't know how this file was saved. ( what encoding was used).
I want to read this file and do some operations with that XML file.
But then I thought : , OK , how would I be able to read the encoding
part from : <?xml version="1.0" encoding="xxxxx"?>
if I don't know to analyze the bytes on hard drive...
After a small discussion with Jon I was told that encoding can be automatic inferred between UTF-8 and UTF-16, and those are the only ones the XML specification dictates are okay to leave out.
Which led me to ask : what about other encodings ? if that XML was saved in encoding-lala
, how would I able to know it ?
As Jon referenced me to the w3c article - I did find an answer:
The XML encoding declaration functions as an internal label on each entity, indicating which character encoding is in use. Before an XML processor can read the internal label, however, it apparently has to know what character encoding is in use—which is what the internal label is trying to indicate.
It does it via :
Because each XML entity not accompanied by external encoding information and not in UTF-8 or UTF-16 encoding must begin with an XML encoding declaration, in which the first characters must be
<?xml
, any conforming processor can detect, after two to four octets of input, which of the following cases apply. In reading this list, it may help to know that in UCS-4,<
is#x0000003C
and?
is#x0000003F
, and the Byte Order Mark required of UTF-16 data streams is#xFEFF
.
So it does use heuristic methods to get the encoding via trying to get the appropriate <?xml
string.
Another helpful info which helps it was the the structure of the encoding
declaration :
Notice the regex , (basic ascii 0..127) chars and encoding
word.
So here is my question :
even if It saved as utf-8/16/blabla - it DOES SUCCEED to recognize the encoding using first bytes (heuristics or not).
If so , why still <?xml version="1.0" encoding="xxxxx"?>
is needed ?