-1

I'm trying to parse HTML ordered/unordered lists recursively into an OOP structure and stumbled on an issue. Let's say I have this section of code:

$text = '
<ol>
    <li>
        <ul>
            <li>aaa</li>
            <li>bbb</li>
        </ul>
    </li>
    <li>fff</li>
    <li>
        <ol>
            <li>ccc</li>
            <li>ddd</li>
        </ol>
    </li>
</ol>
';
preg_match_all("/<ol>(.+?)<\/ol>/mis", $text, $matches);

The problem is that either greedy or lazy matching seem to go as shallow as possible: what I desire is the opposite, to go from deepest to shallowest, so above expression should match:

<ol>
    <li>ccc</li>
    <li>ddd</li>
</ol>

Any idea?

1 Answers1

1

RegEx should only be used to extract specific data from an HTML (treating it as text). More info

Parsing HTML into an OOP structure is what DOMDocument::loadHTML() does. The OOP structure being standardized DOM. Using DOM methods and Xpath expressions you can traverse, read and manipulate the data.

$document = new DOMDocument();
$document->loadHTML($text);
$xpath = new DOMXpath($document);

foreach ($xpath->evaluate('//li[not(.//li)]') as $liLeaf) {
    echo "LABEL: ", $liLeaf->textContent, "\n";
    echo "INDEX: ", $xpath->evaluate('count(preceding-sibling::li)', $liLeaf), "\n";
    echo "LEVEL: ", $xpath->evaluate('count(ancestor::*[self::ol or self::ul])', $liLeaf), "\n";
    echo "IN: ", $xpath->evaluate('local-name(parent::*)', $liLeaf), "\n";
    echo "\n";
}

Output:

LABEL: aaa
INDEX: 0
LEVEL: 2
IN: ul

LABEL: bbb
INDEX: 1
LEVEL: 2
IN: ul

LABEL: fff
INDEX: 1
LEVEL: 1
IN: ol

LABEL: ccc
INDEX: 0
LEVEL: 2
IN: ol

LABEL: ddd
INDEX: 1
LEVEL: 2
IN: ol
ThW
  • 19,120
  • 3
  • 22
  • 44