Parsing HTML with XPath and PHP

Question

Is there a way (using XPath and PHP) to do the following (WITHOUT external XSLT files)?

Remove all tables and their contents
Remove everything after the first h1 tag
Keep only paragraphs (INCLUDING their inner HTML (links, lists, etc))

I received an XSLT answer here, but I'm looking for XPATH queries that don't require external files.

Currently, I've got the HTML in question loaded into a SimpleXmlElement via:

$doc = @DOMDocument::loadHTML($xml);
$data = simplexml_import_dom($doc);

Now I need help with:

$data = $data->xpath('??????');

Been working with this one for several days to no avail. I really appreciate the help.

Edit: I don't particularly care what's inside the paragraphs, as I can use strip_tags to eliminate what I don't want. All I need to do is to isolate the paragraphs from the rest of the source. I suppose a more specific, accurate requirement would be this:

Return only paragraphs (and their html contents) that aren't contained in tables, and only before the first h1 tag

Edit 2:

I think I've gotten most of it with this:
$query = $xpath->query('//p[not(ancestor::table) and not(preceding::h2)]');

The only problem is the loss of the inner HTML.

Based on your assumption, is it alright to say `you just need
, including table inside a p`? — ajreal, Jan 04 '11 at 09:37
*(related)* [Best Methods to parse HTML](http://stackoverflow.com/questions/3577641/best-methods-to-parse-html/3577662#3577662) — Gordon, Jan 04 '11 at 09:43

Gordon · Accepted Answer · 2011-01-04T18:32:18.280

To just get all the P elements not within a table and only before the first h1, you can do

$xp = new DOMXPath($dom);
$expression = '//p[not(preceding::h1[1]) and not(ancestor::table)]';
foreach ($xp->query($expression) as $node) {
    echo $dom->saveXml($node);
}

Demo on codepad

In general, if you know the position of the first h1 in the document, it is more performant to use a direct path to that element, instead of the // query which would search anywhere in the document. For instance, as an alternative you could also use the XPath given by Alejandro in the comments below:

/descendant::h1[1]/preceding::p[not(ancestor::table)]

If you want to create a new DOM Document from the nodes in the source document, you have to import the nodes into a new document.

// src document
$dom = new DOMDocument;
$dom->loadXML($xml);

// dest document
$new = new DOMDocument;
$new->formatOutput = TRUE;

// xpath setup
$xp = new DOMXPath($dom);
$expr = '//p[not(preceding::h1[1]) and not(ancestor::table)]';

// importing nodes into dest document
foreach ($xp->query($expr) as $node) {
    $new->appendChild($new->importNode($node, TRUE));
}

// output dest document
echo $new->saveXML();

Demo on codepad

Some more additions

In your example, you used the error suppression operator. This is bad practise. If you want to disregard any parsing errors from DOM, use

libxml_use_internal_errors(TRUE); // catch any DOM errors with libxml
$dom = new DOMDocument;           // remove the @ as it is bad practise
$dom->loadXML($xhtml);            // use loadHTML if it's not valid XHTML
libxml_clear_errors();            // disregards any DOM related errors

Removing nodes with DOM is always the same approach. Find the node you want to remove. Get to it's parentNode and call removeChild on it with the node to be removed as the argument.

foreach ($dom->getElementsByTagName('foo') as $node) {
    $node->parentNode->removeChild($node);
}

You can also navigate to sibling nodes (and child nodes) without XPath. Here is how to remove all following siblings after the first h1 element

$firstH1 = $dom->getElementsByTagName('h1')->item(0);
while ($firstH1->nextSibling !== NULL) {
    $firstH1->parentNode->removeChild($firstH1->nextSibling);
}
echo $dom->saveXml();

Removing nodes from the DOMDocument, will affect the DOMDocument immediately. In the code above, we are always querying for the first following sibling of the first h1. If there is one, it is removed from the DOMDocument. nextSibling will then point to the sibling after the one just removed (if any).

Fetching and printing all paragraphs is equally easy. To get the outerXML, just pass the node for which you want the outerXML to the saveXML method.

foreach ($dom->getElementsByTagName('p') as $paragraph)
{
    echo $dom->saveXml($paragraph);
}

Anyway, that should get you going. I suggest you familiarize yourself with the DOM API. It's not difficult. You will find that most of the things you will do revolve around properties and method in either DOMDocument, DOMNode and DOMElement (which is a subclass of DOMNode).

+1 Good answer. Maybe `/descendant::h1[1]/preceding::p[not(ancestor::table)]` would be faster (not testing all the precedings for each `p`) — , Jan 04 '11 at 18:19
@Alejandro thanks. Yeah, that might be faster. I've added it as an alternative to the answer above — Gordon, Jan 04 '11 at 18:32

score 0 · Answer 2 · answered Jan 04 '11 at 10:47

Thank you, Gordon.

The solution:

    $dom = @DOMDocument::loadHTML($xml);
    $xpath = new DOMXPath($dom);
    $query = $xpath->query('//p[
        not(ancestor::table) and
        not(preceding::h1[1])
        ]');

    foreach ($query as $node){
        $result .= $dom->saveXml($node);
    }  

    echo $result;

Parsing HTML with XPath and PHP

2 Answers2