I've read the famous post. I've seen the attempts, both in limited success and failure. Oh, the flame wars, both here and elsewhere.
But it can be done.
While I'm aware that the actual argument (read fact) is that regular expressions are unfit to parse structured data trees, due to their inability to monitor and change state, I feel that some blindly discard the possibility. Application logic is necessary to keep state, but as this working example shows, it can be done.
Relevant snippet follows:
const PARSE_MODE_NEXT = 0;
const PARSE_MODE_ELEMENT = 1;
const PARSE_MODE_ENTITY = 3;
const PARSE_MODE_COMMENT = 4;
const PARSE_MODE_CDATA = 5;
const PARSE_MODE_PROC = 6;
protected $_parseModes = array(
self::PARSE_MODE_NEXT => '% < (?: (?: (?<entity>!) (?: (?<comment>--) | (?<cdata>\[CDATA\[) ) ) | (?<proc>\?) )? %six',
self::PARSE_MODE_ELEMENT => '% (?<close>/)? (?<element> .*? ) (?<empty> / )? > (?<text> [^<]* ) %six',
self::PARSE_MODE_ENTITY => '% (?<entity> .*? ) > (?<text> [^<]* ) %six',
self::PARSE_MODE_COMMENT => '% (?<comment> .*? ) --> (?<text> [^<]* ) %six',
self::PARSE_MODE_CDATA => '% (?<cdata> .*? ) \]\]> (?<text> [^<]* ) %six',
self::PARSE_MODE_PROC => '% (?<proc> .*? ) \?> (?<text> [^<]* ) %six',
);
public function load($string){
$parseMode = self::PARSE_MODE_NEXT;
$parseOffset = 0;
$context = $this;
while(preg_match($this->_parseModes[$parseMode], $string, $match, PREG_OFFSET_CAPTURE, $parseOffset)){
if($parseMode == self::PARSE_MODE_NEXT){
switch(true){
case (!($match['entity'][0] || $match['comment'][0] || $match['cdata'][0] || $match['proc'][0])):
$parseMode = self::PARSE_MODE_ELEMENT;
break;
case ($match['proc'][0]):
$parseMode = self::PARSE_MODE_PROC;
break;
case ($match['cdata'][0]):
$parseMode = self::PARSE_MODE_CDATA;
break;
case ($match['comment'][0]):
$parseMode = self::PARSE_MODE_COMMENT;
break;
case ($match['entity'][0]):
$parseMode = self::PARSE_MODE_ENTITY;
break;
}
}else{
switch($parseMode){
case (self::PARSE_MODE_ELEMENT):
switch(true){
case (!($match['close'][0] || $match['empty'][0])):
$context = $context->addChild(new ZuqMLElement($match['element'][0]));
break;
case ($match['empty'][0]):
$context->addChild(new ZuqMLElement($match['element'][0]));
break;
case ($match['close'][0]):
$context = $context->_parent;
break;
}
break;
case (self::PARSE_MODE_ENTITY):
$context->addChild(new ZuqMLEntity($match['entity'][0]));
break;
case (self::PARSE_MODE_COMMENT):
$context->addChild(new ZuqMLComment($match['comment'][0]));
break;
case (self::PARSE_MODE_CDATA):
$context->addChild(new ZuqMLCharacterData($match['cdata'][0]));
break;
case (self::PARSE_MODE_PROC):
$context->addChild(new ZuqMLProcessingInstruction($match['proc'][0]));
break;
}
$parseMode = self::PARSE_MODE_NEXT;
}
if(trim($match['text'][0])){
$context->addChild(new ZuqMLText($match['text'][0]));
}
$parseOffset = $match[0][1] + strlen($match[0][0]);
}
}
Is it complete? Nope.
Is it unbreakable? Certainly not.
Is it fast? Haven't benchmarked, but I cannot imagine it's as fast as DOM
.
Does it support XPath/XQuery? Obviously not.
Does it validate or perform any other auxiliary tasks? Sure doesn't.
Will it supersede DOM? Hell no.
However, will it parse this?
<?xml version="1.0" encoding="utf-8"?>
<!ENTITY name="value">
<root>
<node>
<node />
Foo
<node name="value">
<node>Bar</node>
</node>
<!-- Comment -->
</node>
<node>
<[CDATA[ Character Data ]]>
</node>
</root>
Yes. Yes it will.
While I would welcome this thread becoming a Community Wiki given it meets the requirements, I'll turn this statement into a question.
Focusing on the regex, can anyone foresee a situation under which this would fail horribly when used against well-formed markup? I think I've covered all my bases.
I have no intention of "stirring the pot", however I'd like some insight from both sides of the coin.
Note also that the purpose for having written this was that SimpleXML
was too simple, and DOM
was too strict for one of my applications.