Parsing XML/XHTML data with Regex

Question

I've read the famous post. I've seen the attempts, both in limited success and failure. Oh, the flame wars, both here and elsewhere.

But it can be done.

While I'm aware that the actual argument (read fact) is that regular expressions are unfit to parse structured data trees, due to their inability to monitor and change state, I feel that some blindly discard the possibility. Application logic is necessary to keep state, but as this working example shows, it can be done.

Relevant snippet follows:

const PARSE_MODE_NEXT = 0;
const PARSE_MODE_ELEMENT = 1;
const PARSE_MODE_ENTITY = 3;
const PARSE_MODE_COMMENT = 4;
const PARSE_MODE_CDATA = 5;
const PARSE_MODE_PROC = 6;

protected $_parseModes = array(
        self::PARSE_MODE_NEXT     => '% < (?: (?: (?<entity>!) (?: (?<comment>--) | (?<cdata>\[CDATA\[) ) ) | (?<proc>\?) )? %six',
        self::PARSE_MODE_ELEMENT  => '% (?<close>/)? (?<element> .*? ) (?<empty> / )? > (?<text> [^<]* ) %six',
        self::PARSE_MODE_ENTITY   => '% (?<entity> .*? ) > (?<text> [^<]* ) %six',
        self::PARSE_MODE_COMMENT  => '% (?<comment> .*? ) --> (?<text> [^<]* ) %six',
        self::PARSE_MODE_CDATA    => '% (?<cdata> .*? ) \]\]> (?<text> [^<]* ) %six',
        self::PARSE_MODE_PROC     => '% (?<proc> .*? ) \?> (?<text> [^<]* ) %six',
    );

public function load($string){
    $parseMode = self::PARSE_MODE_NEXT;
    $parseOffset = 0;
    $context = $this;
    while(preg_match($this->_parseModes[$parseMode], $string, $match, PREG_OFFSET_CAPTURE, $parseOffset)){
        if($parseMode == self::PARSE_MODE_NEXT){
            switch(true){
                case (!($match['entity'][0] || $match['comment'][0] || $match['cdata'][0] || $match['proc'][0])):
                    $parseMode = self::PARSE_MODE_ELEMENT;
                    break;
                case ($match['proc'][0]):
                    $parseMode = self::PARSE_MODE_PROC;
                    break;
                case ($match['cdata'][0]):
                    $parseMode = self::PARSE_MODE_CDATA;
                    break;
                case ($match['comment'][0]):
                    $parseMode = self::PARSE_MODE_COMMENT;
                    break;
                case ($match['entity'][0]):
                    $parseMode = self::PARSE_MODE_ENTITY;
                    break;
            }
        }else{
            switch($parseMode){
                case (self::PARSE_MODE_ELEMENT):
                    switch(true){
                        case (!($match['close'][0] || $match['empty'][0])):
                            $context = $context->addChild(new ZuqMLElement($match['element'][0]));
                            break;
                        case ($match['empty'][0]):
                            $context->addChild(new ZuqMLElement($match['element'][0]));
                            break;
                        case ($match['close'][0]):
                            $context = $context->_parent;
                            break;
                    }
                    break;
                case (self::PARSE_MODE_ENTITY):
                    $context->addChild(new ZuqMLEntity($match['entity'][0]));
                    break;
                case (self::PARSE_MODE_COMMENT):
                    $context->addChild(new ZuqMLComment($match['comment'][0]));
                    break;
                case (self::PARSE_MODE_CDATA):
                    $context->addChild(new ZuqMLCharacterData($match['cdata'][0]));
                    break;
                case (self::PARSE_MODE_PROC):
                    $context->addChild(new ZuqMLProcessingInstruction($match['proc'][0]));
                    break;
            }
            $parseMode = self::PARSE_MODE_NEXT;
        }
        if(trim($match['text'][0])){
            $context->addChild(new ZuqMLText($match['text'][0]));
        }
        $parseOffset = $match[0][1] + strlen($match[0][0]);
    }

}

Is it complete? Nope.

Is it unbreakable? Certainly not.

Is it fast? Haven't benchmarked, but I cannot imagine it's as fast as DOM.

Does it support XPath/XQuery? Obviously not.

Does it validate or perform any other auxiliary tasks? Sure doesn't.

Will it supersede DOM? Hell no.

However, will it parse this?

<?xml version="1.0" encoding="utf-8"?>
<!ENTITY name="value">
<root>
    <node>
        <node />
        Foo
        <node name="value">
            <node>Bar</node>
        </node>
        <!-- Comment -->
    </node>
    <node>
        <[CDATA[ Character Data ]]>
    </node>
</root>

Yes. Yes it will.

While I would welcome this thread becoming a Community Wiki given it meets the requirements, I'll turn this statement into a question.

Focusing on the regex, can anyone foresee a situation under which this would fail horribly when used against well-formed markup? I think I've covered all my bases.

I have no intention of "stirring the pot", however I'd like some insight from both sides of the coin.

Note also that the purpose for having written this was that SimpleXML was too simple, and DOM was too strict for one of my applications.

score 1 · Accepted Answer · answered Jan 25 '11 at 09:54

Focusing on the regex, can anyone foresee a situation under which this would fail horribly when used against well-formed markup?When run against the XML conformance test suite, how many well-formed XML documents does it reject, and how many ill-formed XML documents does it accept?

Perhaps the biggest objection from those who share the culture of the XML community is that it will not only parse most well-formed XML documents, it will also parse most non-XML documents, in the sense that it doesn't tell you they are ill-formed. Now perhaps you think that doesn't matter too much in your environment - but in the end, if you accept ill-formed documents, then people will start sending you ill-formed documents, and before long you are in the same mess as HTML, where you have to accept any old rubbish for legacy reasons.

I don't know enough PHP to judge quickly how well your code will work against well-formed XML. But I question the motivation - why one earth would you want to write a cheap-and-dirty-and-slow XML parser by hand when there are perfectly good-and-correct-and-fast-and-free ones available off the shelf?

While you raise some inarguable points, the main reasons for having written this were: `SimpleXML` was too simple, providing limited support for what I needed; `DOM` was too strict, causing more problems than it was solving; I needed something that was migration friendly, that could easily be included in any project regardless of environment restrictions; For it's purpose, I'm not concerned with validation, only the ability to parse and restructure structured documents. *However*, speed is an issue, and at over 30 times slower (100,000 test iterations) than `DOM`, it fails, and is unusable. :( — Dan Lugg, Jan 26 '11 at 20:13

Parsing XML/XHTML data with Regex

1 Answers1

Linked