3

I'm currently rewriting a PHP class that tried to split an XML file into smaller chunks to use XMLReader and XMLWriter instead of the current basic filesystem and regex approach.

However, I can't figure out how to get the version, encoding and standalone flags from the XML preamble.

The start of my test XML file looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE fakedoctype SYSTEM "fake_doc_type.dtd">

 <!--
 This is a comment, it's here to try and get the parser to break in some way
 --> 

<root attribute="value" otherattribute="othervalue">

I can open it okay with the reader and move through the document with read(), next() etc, but I just can't seem to get whatever's in <?xml ... ?>. The first thing I'm able to access is the fake DOCTYPE.

My testing code is as follows:

$a = new XMLReader ();
var_dump ($a -> open ('/path/to/test/file.xml')) // true
var_dump ($a -> nodeType); // 0
var_dump ($a -> name); // ""
var_dump ($a -> readOuterXML ()); // ''
var_dump ($a -> read ()); // true
var_dump ($a -> nodeType); // 10
var_dump ($a -> readOuterXML ()); // <!DOCTYPE fakedoctype SYSTEM "fake_doc_type.dtd">

Of course I could just always assume XML 1.0, encoding UTF8 and standalone = yes, but for the sake of correctness I'd really rather be able to grab what the values in my source feed are and use them when generating the split files.

The documentation on XMLReader and XMLwriter seems to be very poor, so there's every chance I've just missed something in the docs. Does anyone know what to do in this case?

GordonM
  • 31,179
  • 15
  • 87
  • 129
  • 2
    Yes, the documentation is quite poor. I only find the very general info, _“It is important to note that internally, libxml uses the UTF-8 encoding and as such, the encoding of the retrieved contents will always be in UTF-8 encoding.”_ - but no way to retrieve info about the original document. If no other solution comes up, I’d maybe read the first line of the document separately and use a RegExp to parse that info manually if it’s of importance. – CBroe Mar 18 '13 at 13:37

1 Answers1

3

What I know from XMLReader even it has the XMLReader::XML_DECLARATION constant, I have never experienced it when traversing the document with XMLReader::read() in the XMLReader::$nodeType property.

It looks like that it gets skipped and I also wondered why this is and I have not yet found any flag or option to change this behavior.

For the output, XMLReader always returns UTF-8 encoded strings. That's the same as with the other libxml based parts in PHP. So from that side, all is clear. But I assume that is not the part you're interested in, but the concrete string input in the file you open with XMLReader::open().

Not specifically for XMLReader I once created a utility class I named XMLRecoder which is able to detect the encoding of an XML string based on the XML declaration and also based on BOM. I think you should do both. That's one part I think you still need to use regular expressions for but as the XML declaration must be the first thing and also it is a processing instruction (PI) that is very well and strict defined you should be able to peek in there.

This is some related part from the XMLRecoder code:

### excerpt from https://gist.github.com/hakre/5194634 

/**
 * pcre pattern to access EncodingDecl, see <http://www.w3.org/TR/REC-xml/#sec-prolog-dtd>
 */
const DECL_PATTERN = '(^<\?xml\s+version\s*=\s*(["\'])(1\.\d+)\1\s+encoding\s*=\s*(["\'])(((?!\3).)*)\3)';
const DECL_ENC_GROUP = 4;
const ENC_PATTERN = '(^[A-Za-z][A-Za-z0-9._-]*$)';

...

($result = preg_match(self::DECL_PATTERN, $buffer, $matches, PREG_OFFSET_CAPTURE))
    && $result = $matches[self::DECL_ENC_GROUP];

As this shows it goes until encoding, so it's not complete. However for the needs to extract encoding (and for your needs version), it should do the job. I had run this against a tons (thousands) of random XML documents for testing.

Another part is the BOM detection:

### excerpt from https://gist.github.com/hakre/5194634 

const BOM_UTF_8 = "\xEF\xBB\xBF";
const BOM_UTF_32LE = "\xFF\xFE\x00\x00";
const BOM_UTF_16LE = "\xFF\xFE";
const BOM_UTF_32BE = "\x00\x00\xFE\xFF";
const BOM_UTF_16BE = "\xFE\xFF";

...

/**
 * @param string $string string (recommended length 4 characters/octets)
 * @param string $default (optional) if none detected what to return
 * @return string Encoding, if it can not be detected defaults $default (NULL)
 * @throws InvalidArgumentException
 */
public function detectEncodingViaBom($string, $default = NULL)
{
    $len = strlen($string);

    if ($len > 4) {
        $string = substr($string, 0, 4);
    } elseif ($len < 4) {
        throw new InvalidArgumentException(sprintf("Need at least four characters, %d given.", $len));
    }

    switch (true) {
        case $string === self::BOM_UTF_16BE . $string[2] . $string[3]:
            return "UTF-16BE";

        case $string === self::BOM_UTF_8 . $string[3]:
            return "UTF-8";

        case $string === self::BOM_UTF_32LE:
            return "UTF-32LE";

        case $string === self::BOM_UTF_16LE . $string[2] . $string[3]:
            return "UTF-16LE";

        case $string === self::BOM_UTF_32BE:
            return "UTF-32BE";
    }

    return $default;
}

With the BOM detection I also did run this against the same set of XML documents, however, not many were with BOMs. As you can see, the detection order is optimized for the more common scenarios while taking care of the duplicate binary patterns between the different BOMs. Most documents I encountered are w/o BOM and you mainly need it to find out if the document is UTF-32 encoded.

Hope this at least gives some insights.

hakre
  • 193,403
  • 52
  • 435
  • 836
  • Given the work involved versus the payoff, I think it's probably best to just assume UTF8. I'll be sure to come back to this answer if that turns out to not be adequate though. Honestly I've got bigger problems with XMLreader and XMLwriter just now than this one. :) Working with them is not pleasant. – GordonM Mar 19 '13 at 09:45
  • Well if you're working with `XMLReader`, I can suggest you one project I've got running it's called [*XMLReaderIterator*](http://git.io/xmlreaderiterator) which offers nice interfaces around XMLReader and solve problems with generic programming (iterators): [`XMLReaderIterator` Github repro](https://github.com/hakre/XMLReaderIterator) and there is also a single-file ongoing [`XMLReaderIterator` gist release](https://gist.github.com/hakre/5147685) - maybe it's helpful. Also if you can turn your problems into more generic questions here on SO would be great, we need more XMLReader based QA :). – hakre Mar 19 '13 at 09:56
  • I'd say what we really could do with is proper documentation for XMLReader and XMLWriter on php.net. :) It's nowhere near the standard of the rest of the docs. – GordonM Mar 19 '13 at 10:02
  • Well I was not missing that much. The XML declaration you point to here was one of the things I'm not really sure about which it is never returned as `nodeType`. Another thing that could be better documented is what the meaning of [`XMLReader::$value`](http://www.php.net/manual/en/class.xmlreader.php#xmlreader.props.value) is because it's not always the text-value. I think XMLReader differs here with significant whitespace. And XMLReader is more concrete on XML than the other libs, just linked this on my blog yesterday: http://www.xml.com/axml/testaxml.htm – hakre Mar 19 '13 at 10:06