To just get all the P elements not within a table and only before the first h1, you can do
$xp = new DOMXPath($dom);
$expression = '//p[not(preceding::h1[1]) and not(ancestor::table)]';
foreach ($xp->query($expression) as $node) {
echo $dom->saveXml($node);
}
Demo on codepad
In general, if you know the position of the first h1 in the document, it is more performant to use a direct path to that element, instead of the //
query which would search anywhere in the document. For instance, as an alternative you could also use the XPath given by Alejandro in the comments below:
/descendant::h1[1]/preceding::p[not(ancestor::table)]
If you want to create a new DOM Document from the nodes in the source document, you have to import the nodes into a new document.
// src document
$dom = new DOMDocument;
$dom->loadXML($xml);
// dest document
$new = new DOMDocument;
$new->formatOutput = TRUE;
// xpath setup
$xp = new DOMXPath($dom);
$expr = '//p[not(preceding::h1[1]) and not(ancestor::table)]';
// importing nodes into dest document
foreach ($xp->query($expr) as $node) {
$new->appendChild($new->importNode($node, TRUE));
}
// output dest document
echo $new->saveXML();
Demo on codepad
Some more additions
In your example, you used the error suppression operator. This is bad practise. If you want to disregard any parsing errors from DOM, use
libxml_use_internal_errors(TRUE); // catch any DOM errors with libxml
$dom = new DOMDocument; // remove the @ as it is bad practise
$dom->loadXML($xhtml); // use loadHTML if it's not valid XHTML
libxml_clear_errors(); // disregards any DOM related errors
Removing nodes with DOM is always the same approach. Find the node you want to remove. Get to it's parentNode
and call removeChild
on it with the node to be removed as the argument.
foreach ($dom->getElementsByTagName('foo') as $node) {
$node->parentNode->removeChild($node);
}
You can also navigate to sibling nodes (and child nodes) without XPath. Here is how to remove all following siblings after the first h1 element
$firstH1 = $dom->getElementsByTagName('h1')->item(0);
while ($firstH1->nextSibling !== NULL) {
$firstH1->parentNode->removeChild($firstH1->nextSibling);
}
echo $dom->saveXml();
Removing nodes from the DOMDocument
, will affect the DOMDocument
immediately. In the code above, we are always querying for the first following sibling of the first h1. If there is one, it is removed from the DOMDocument
. nextSibling
will then point to the sibling after the one just removed (if any).
Fetching and printing all paragraphs is equally easy. To get the outerXML, just pass the node for which you want the outerXML to the saveXML
method.
foreach ($dom->getElementsByTagName('p') as $paragraph)
{
echo $dom->saveXml($paragraph);
}
Anyway, that should get you going. I suggest you familiarize yourself with the DOM API. It's not difficult. You will find that most of the things you will do revolve around properties and method in either DOMDocument
, DOMNode
and DOMElement
(which is a subclass of DOMNode
).
, including table inside a p`?
– ajreal Jan 04 '11 at 09:37