looping through all children of element with domdocument and extract text-content

Question

This is the structure of a xml-file (odt-file), which I try to parse:

<office:body>
    <office:text>
        <text:h text:style-name="P1" text:outline-level="2">Chapter 1</text:h>
            <text:p text:style-name="Standard">Lorem ipsum. </text:p>

            <text:h text:style-name="Heading3" text:outline-level="3">Subtitle 2</text:h>
                <text:p text:style-name="Standard"><text:span text:style-name="T5">10</text:span><text:span text:style-name="T6">:</text:span><text:s/>Text (100%)</text:p>
                    <text:p text:style-name="Explanation">Further informations.</text:p>
                <text:p text:style-name="Standard">9.7:<text:s/>Text (97%)</text:p>
                    <text:p text:style-name="Explanation">Further informations.</text:p>
                <text:p text:style-name="Standard"><text:span text:style-name="T9">9.1:</text:span><text:s/>Text (91%)</text:p>
                    <text:p text:style-name="Explanation">Further informations.</text:p>
                    <text:p text:style-name="Explanation">More furter informations.</text:p>
    </office:text>
</office:body>

With XML-Reader I did that this way:

while ($reader->read()){ 
    if ($reader->nodeType == XMLREADER::ELEMENT && $reader->name === 'text:h') { 
        if ($reader->getAttribute('text:outline-level')=="2") $html .= '<h2>'.$reader->expand()->textContent.'</h2>';
    }
    elseif ($reader->nodeType == XMLREADER::ELEMENT && $reader->name === 'text:p') { 
        if ($reader->getAttribute('text:style-name')=="Standard") {
            $html .= '<p>'.$reader->readInnerXML().'<p>';
        }
        else if {
            // Doing something different
        }
    }
}
echo $html;

Now I would like to do the same thing with DOMDocument, but I need some help with the syntax. How can I loop through all children of office:text? While looping through all nodes, I would check via if/else what to do (text:h vs. text:p).

I also need to replace every text:s (if there are such elements in text:p) with a whitespace...

$reader = new DOMDocument();
$reader->preserveWhiteSpace  = false;
$reader->load('zip://content.odt#content.xml');

$body = $reader->getElementsByTagName( 'office:text' )->item( 0 );
foreach( $body->childNodes as $node ) echo $node->nodeName . PHP_EOL;

Or would it be smarter to loop through all text elements? If this is the case, still the question, how to do that.

$elements = $reader->getElementsByTagName('text');
foreach($elements as $node){
    foreach($node->childNodes as $child) {
        echo $child->nodeName.': ';
        echo $child->nodeValue.'<br>';
        // check for type...
    }
}

score 0 · Accepted Answer · edited May 23 '17 at 10:33

One of the most easy ways to do that with DOMDocument is with the help of DOMXPath.

Taking your question literally:

How can I loop through all children of office:text?

This can be represented as an XPath expression:

//office:text/child::node()

However you're using a little wrong wording here. It's not only all children, but also the children of the children and so on and so forth - that is all descendants:

//office:text/descendant::node()

Or with the abbreviated syntax:

//office:text//node()

Compare with: XPath to Get All ChildNodes and not the Parent Node

For that to loop over in PHP, you need to register the namespace for the office prefix and then you loop over the xpath result with a foreach: $xpath = new DOMXPath($reader); $xpath->registerNamespace('office', $xml_namespace_uri_of_office_namespace);

$descendants = $xpath->query('//office:text//node()');
foreach ($descendants as $node) {
    // $node is a DOMNode as of DOMElement, DOMText, ...
}

XPath not in general but in PHP's libxml based libraries does return the nodes in document-order. That is the order you're looking for.

Compare with: XPath query result order

looping through all children of element with domdocument and extract text-content

1 Answers1