What I know from XMLReader
even it has the XMLReader::XML_DECLARATION
constant, I have never experienced it when traversing the document with XMLReader::read()
in the XMLReader::$nodeType
property.
It looks like that it gets skipped and I also wondered why this is and I have not yet found any flag or option to change this behavior.
For the output, XMLReader
always returns UTF-8 encoded strings. That's the same as with the other libxml based parts in PHP. So from that side, all is clear. But I assume that is not the part you're interested in, but the concrete string input in the file you open with XMLReader::open()
.
Not specifically for XMLReader
I once created a utility class I named XMLRecoder
which is able to detect the encoding of an XML string based on the XML declaration and also based on BOM. I think you should do both. That's one part I think you still need to use regular expressions for but as the XML declaration must be the first thing and also it is a processing instruction (PI) that is very well and strict defined you should be able to peek in there.
This is some related part from the XMLRecoder
code:
### excerpt from https://gist.github.com/hakre/5194634
/**
* pcre pattern to access EncodingDecl, see <http://www.w3.org/TR/REC-xml/#sec-prolog-dtd>
*/
const DECL_PATTERN = '(^<\?xml\s+version\s*=\s*(["\'])(1\.\d+)\1\s+encoding\s*=\s*(["\'])(((?!\3).)*)\3)';
const DECL_ENC_GROUP = 4;
const ENC_PATTERN = '(^[A-Za-z][A-Za-z0-9._-]*$)';
...
($result = preg_match(self::DECL_PATTERN, $buffer, $matches, PREG_OFFSET_CAPTURE))
&& $result = $matches[self::DECL_ENC_GROUP];
As this shows it goes until encoding, so it's not complete. However for the needs to extract encoding (and for your needs version), it should do the job. I had run this against a tons (thousands) of random XML documents for testing.
Another part is the BOM detection:
### excerpt from https://gist.github.com/hakre/5194634
const BOM_UTF_8 = "\xEF\xBB\xBF";
const BOM_UTF_32LE = "\xFF\xFE\x00\x00";
const BOM_UTF_16LE = "\xFF\xFE";
const BOM_UTF_32BE = "\x00\x00\xFE\xFF";
const BOM_UTF_16BE = "\xFE\xFF";
...
/**
* @param string $string string (recommended length 4 characters/octets)
* @param string $default (optional) if none detected what to return
* @return string Encoding, if it can not be detected defaults $default (NULL)
* @throws InvalidArgumentException
*/
public function detectEncodingViaBom($string, $default = NULL)
{
$len = strlen($string);
if ($len > 4) {
$string = substr($string, 0, 4);
} elseif ($len < 4) {
throw new InvalidArgumentException(sprintf("Need at least four characters, %d given.", $len));
}
switch (true) {
case $string === self::BOM_UTF_16BE . $string[2] . $string[3]:
return "UTF-16BE";
case $string === self::BOM_UTF_8 . $string[3]:
return "UTF-8";
case $string === self::BOM_UTF_32LE:
return "UTF-32LE";
case $string === self::BOM_UTF_16LE . $string[2] . $string[3]:
return "UTF-16LE";
case $string === self::BOM_UTF_32BE:
return "UTF-32BE";
}
return $default;
}
With the BOM detection I also did run this against the same set of XML documents, however, not many were with BOMs. As you can see, the detection order is optimized for the more common scenarios while taking care of the duplicate binary patterns between the different BOMs. Most documents I encountered are w/o BOM and you mainly need it to find out if the document is UTF-32 encoded.
Hope this at least gives some insights.