0

I try to parse html source code. I have a nested regular expression

(\˂[0-9a-z\s#-_="]\˃((?>[^\˂\˃]+)|(?R))(\˂/[a-z]\˃)?\s)* inspired from here:

My problem is that I only get 2 levels (the div and table tags). Is there something wrong in my RegEx?

<pre>
<?php
$pattern = '/(\˂[0-9a-z\s#-_="]*\˃((?>[^\˂\˃]+)|(?R))*(\˂\/[a-z]*\˃)?\s*)*/mx';

$subject = <<<EOT
˂div class="post"˃
    ˂table˃
        ˂tbody˃
            ˂tr height="12"˃
                ˂td˃˂/td˃
                ˂td width="20" class="strip" rowspan="5"˃   
                    ˂div class="follow unpublish"˃☆˂/div˃                                   
                    ˂div class="follow report"˃⚐˂/div˃
                ˂/td˃
            ˂/tr˃
        ˂/tbody˃
    ˂/table˃
˂/div˃
EOT;

preg_match($pattern, $subject, $matches);
print_r($matches);
?>
</pre>

Run on phpFiddle

Community
  • 1
  • 1
profimedica
  • 2,716
  • 31
  • 41
  • 2
    There's this typical [**TONY THE PONY**](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) smelling... in short, use a parser, ie `DomDocument` instead. – Jan Nov 19 '16 at 08:55
  • Use of regex should be possible for a clean html (no javascript os style, just tags with string attributes). I try to avoid a parser. Once defined, a regex is easier to use then a parser. And can be used on other languages that support nested regex. – profimedica Nov 19 '16 at 09:24
  • Roses are red, Skies are blue, DOMs are severed by parsers solely. – Mohammad Yusuf Nov 19 '16 at 09:25

0 Answers0