Parsing non-node, intermittent XML values using regex

Question

This is a question for the regex gurus.

If I have a series of xml nodes, I would like to parse out (using regex) the contained node values that exist on the same level as my current node. For instance, if I have:

<top-node>
    Hi
    <second-node>
        Hello
        <inner-node>
        </inner-node>
    </second-node>
    Hey
    <third-node>
       Foo
    </third-node>
    Bar
<top-node>

I would like to retrieve an array that is:

array(
    1 => 'Hi',
    2 => 'Hey',
    3 => 'Bar'
)

I know I can start with

$inside = preg_match('~<(\S+).*?>(?P<inside>(.|\s)*)</\1>~', $original_text);

and that will retrieve the text sans the top-node. However, the next step is a bit beyond my regex abilities.

EDIT: Actually, that preg_match appears only to work if the $original_text is all on the same line. Additionally, I think I can use a preg_split with a very similar regex to retrieve what I am looking for- it just isn't working across multiple lines.

NOTE: I appreciate and will oblige any requests for clarification; however, my question is pretty specific and I mean what I am asking, so don't give an answer like "go use SimpleXML" or something. Thank you for any and all assistance.

some (relevant) comic relief: http://stackoverflow.com/a/1732454/588079, http://stackoverflow.com/q/8577060/588079, http://www.codinghorror.com/blog/2009/11/parsing-html-the-cthulhu-way.html — GitaarLAB, Jul 12 '13 at 23:54
Still continuing your quest or are you getting ready for a parser? — GitaarLAB, Jul 13 '13 at 00:15
Nah, I would prefer not to go through utilizing some additional library to accomplish what *should* be a relatively simple task. — MirroredFate, Jul 13 '13 at 00:20
You say "so don't give an answer like 'go use SimpleXML' or something", but that *is* the answer. — Andy Lester, Jul 13 '13 at 04:51

Ro Yo Mi · Accepted Answer · 2013-07-13T03:38:56.463

Description

This regex will capture the first level of text

(?:[\s\r\n]*<([^>\s]+)\s?(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>.*?<\/\1>)?[\s\r\n]*\K(?!\Z)(?:(?![\s\r\n]*(?:<|\Z)).)*1

enter image description here

Expanded

(?:[\s\r\n]*<([^>\s]+)\s?(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>.*?<\/\1>)?   # match any open tags until the close tags if they exist
[\s\r\n]*    # match any leading spaces or new line characters 
\K           # reset the capture and only capture the desired substring which follows
(?!\Z)       # validate substring is not the end of the string, this prevents the phantom empty array value at the end
(?:(?![\s\r\n]*(?:<|\Z)).)*    # capture the text inside the current substring, this expression is self limiting and will stop when it sees whitespace ahead followed by end of string or a new tag

Example

Sample Text

This is assuming you've removed the first top level tags

Hi
<second-node>
    Hello
    <inner-node>
    </inner-node>
</second-node>
Hey
<third-node>
   Foo
</third-node>
Bar

Capture Groups

0: is the actual captured group
1: is the name of the subtag which is then back referenced inside the regex

[0] => Array
    (
        [0] => Hi
        [1] => Hey
        [2] => Bar
    )

[1] => Array
    (
        [0] => 
        [1] => second-node
        [2] => third-node
    )

Disclaimer

This solution will get hung up on nested structures like:

Hi
<second-node>
    Hello
    <second-node>
    </second-node>
    This string will be found
</second-node>
Hey

Thanks for the good answer! I won't be able to try this for a few days, but I will let you know how it turns out. — MirroredFate, Jul 13 '13 at 01:38

score 1 · Answer 2 · answered Jul 13 '13 at 01:59

Based on your own idea, using a preg_split I came up with:

$raw="<top-node>
    Hi
    <second-node>
        Hello
        <inner-node>
        </inner-node>
    </second-node>
    Hey
    <third-node>
       Foo
    </third-node>
    Bar
</top-node>";

$reg='~<(\S+).*?>(.*?)</\1>~s';
preg_match_all($reg, $raw, $res);
$res = explode(chr(31), preg_replace($reg, chr(31), $res[2][0]));

Note, chr(31) is the 'unit seperator'

Testing resulting array with:

echo ("<xmp>start\n" . print_r($res, true) . "\nfin</xmp>");

That seems to work for 1 node, giving you the array you asked for, but it will probably have all sorts of problems with it.. You might want to trim the returned values to.

EDIT:
Denomales' answer is probably better..

That's pretty much what I came up with after asking this question. Unfortunately, I then ran into a problem where if the string I am matching is over a certain length, it doesn't work. — MirroredFate, Jul 15 '13 at 15:56

Parsing non-node, intermittent XML values using regex

2 Answers2

Description

Example

Disclaimer