3

Whenever I try to walk to the DOM of an HTML file (I'm only interested in the text elements), if I point to the node's textContent property it will echo all the text that is in that particular node tree. For example:

<html lang="en">
<body>
    <p> 1st text I need</p>
    <a href="#">2nd text I need</a>
    <table>
        <tr>
            <td>3rd text I need</td>
        </tr>
    </table>
</body>
</html> 

That results in the following:

#document
html
html 1st text I need 2nd text I need 3rd text I need 
body 1st text I need 2nd text I need 3rd text I need 
p 1st text I need
a 2nd text I need
table 3rd text I need 
tr 3rd text I need 
td 3rd text I need

I'd like to extract the text only from elements that have direct text content. In the example above, they would be p, a and td.

How can I do that?

Here's the code (extracted from here):

<?php

$doc = new DOMDocument();
@$doc->loadHTMLFile('test.html');
walkDom($doc);


function walkDom($node, $level = 0)
{
$indent = '';
for ($i = 0; $i < $level; $i++)
$indent .= '  '; //prettifying the output
if($node->nodeType != XML_TEXT_NODE) {
    echo $indent .'<b>' . $node->nodeName .'</b>';
    if( $node->nodeType == XML_ELEMENT_NODE ) {
        echo $node->textContent;
    }
    echo '<br>';
}
    $cNodes = $node->childNodes;
    if (count($cNodes) > 0)
    {
        $level++; // go one level deeper
        foreach($cNodes as $cNode)
        walkDom($cNode, $level); 
        $level = $level - 1; 
    }
}
Paulo Hgo
  • 834
  • 1
  • 11
  • 26

1 Answers1

2

You can use XPath on the DOM document. text() returns all the text nodes.

$doc = new DOMDocument;
$doc->loadhtml('<html lang="en">
<body>
    <p> 1st text I need</p>
    <a href="#">2nd text I need</a>
    <table>
        <tr>
            <td>3rd text I need</td>
        </tr>
    </table>
</body>
</html>');
$doc->normalizeDocument();
$xpath = new DOMXpath($doc);
$results = $xpath->query("//text()");
foreach($results as $node) {
    echo trim($node->wholeText);
}
miken32
  • 42,008
  • 16
  • 111
  • 154
  • Thanks, that works. Is it possible to modify those text elements though (that was my ultimate goal and I forgot to mention)? – Paulo Hgo Feb 04 '17 at 00:29
  • That would be a whole separate question. – miken32 Feb 04 '17 at 00:31
  • 1
    Fair point. I'll submit another question. Thanks for your answer. – Paulo Hgo Feb 04 '17 at 00:32
  • In case I don't see your question, short answer is `$node->nodeValue = str_replace("this", "that", $node->wholeText);` but it can be more complicated in some situations. – miken32 Feb 04 '17 at 00:48