3

This is a question for the regex gurus.

If I have a series of xml nodes, I would like to parse out (using regex) the contained node values that exist on the same level as my current node. For instance, if I have:

<top-node>
    Hi
    <second-node>
        Hello
        <inner-node>
        </inner-node>
    </second-node>
    Hey
    <third-node>
       Foo
    </third-node>
    Bar
<top-node>

I would like to retrieve an array that is:

array(
    1 => 'Hi',
    2 => 'Hey',
    3 => 'Bar'
)

I know I can start with

$inside = preg_match('~<(\S+).*?>(?P<inside>(.|\s)*)</\1>~', $original_text);

and that will retrieve the text sans the top-node. However, the next step is a bit beyond my regex abilities.

EDIT: Actually, that preg_match appears only to work if the $original_text is all on the same line. Additionally, I think I can use a preg_split with a very similar regex to retrieve what I am looking for- it just isn't working across multiple lines.

NOTE: I appreciate and will oblige any requests for clarification; however, my question is pretty specific and I mean what I am asking, so don't give an answer like "go use SimpleXML" or something. Thank you for any and all assistance.

MirroredFate
  • 12,396
  • 14
  • 68
  • 100

2 Answers2

1

Description

This regex will capture the first level of text

(?:[\s\r\n]*<([^>\s]+)\s?(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>.*?<\/\1>)?[\s\r\n]*\K(?!\Z)(?:(?![\s\r\n]*(?:<|\Z)).)*1

enter image description here

Expanded

(?:[\s\r\n]*<([^>\s]+)\s?(?:[^>=]|='[^']*'|="[^"]*"|=[^'"][^\s>]*)*>.*?<\/\1>)?   # match any open tags until the close tags if they exist
[\s\r\n]*    # match any leading spaces or new line characters 
\K           # reset the capture and only capture the desired substring which follows
(?!\Z)       # validate substring is not the end of the string, this prevents the phantom empty array value at the end
(?:(?![\s\r\n]*(?:<|\Z)).)*    # capture the text inside the current substring, this expression is self limiting and will stop when it sees whitespace ahead followed by end of string or a new tag

Example

Sample Text

This is assuming you've removed the first top level tags

Hi
<second-node>
    Hello
    <inner-node>
    </inner-node>
</second-node>
Hey
<third-node>
   Foo
</third-node>
Bar

Capture Groups

0: is the actual captured group
1: is the name of the subtag which is then back referenced inside the regex

[0] => Array
    (
        [0] => Hi
        [1] => Hey
        [2] => Bar
    )

[1] => Array
    (
        [0] => 
        [1] => second-node
        [2] => third-node
    )

Disclaimer

This solution will get hung up on nested structures like:

Hi
<second-node>
    Hello
    <second-node>
    </second-node>
    This string will be found
</second-node>
Hey
Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43
1

Based on your own idea, using a preg_split I came up with:

$raw="<top-node>
    Hi
    <second-node>
        Hello
        <inner-node>
        </inner-node>
    </second-node>
    Hey
    <third-node>
       Foo
    </third-node>
    Bar
</top-node>";

$reg='~<(\S+).*?>(.*?)</\1>~s';
preg_match_all($reg, $raw, $res);
$res = explode(chr(31), preg_replace($reg, chr(31), $res[2][0]));

Note, chr(31) is the 'unit seperator'

Testing resulting array with:

echo ("<xmp>start\n" . print_r($res, true) . "\nfin</xmp>");

That seems to work for 1 node, giving you the array you asked for, but it will probably have all sorts of problems with it.. You might want to trim the returned values to.

EDIT:
Denomales' answer is probably better..

GitaarLAB
  • 14,536
  • 11
  • 60
  • 80
  • That's pretty much what I came up with after asking this question. Unfortunately, I then ran into a problem where if the string I am matching is over a certain length, it doesn't work. – MirroredFate Jul 15 '13 at 15:56