6

I need to load some arbitrary HTML into an existing DOMDocument tree. Previous answers suggest using DOMDocumentFragment and its appendXML method to handle this.

As @Owlvark indicates in the comments, xml is not html and therefore this is not a good solution.

The main issue that I had with it was that entities like &ndash were causing errors because the appendXML method expects well formed XML.

We could define the entities, but this doesn't take care of the problem that not all html is valid xml.

What is a good solution for importing HTML into a DOMDocument tree?

Community
  • 1
  • 1
wmarbut
  • 4,595
  • 7
  • 42
  • 72
  • 1
    You might just have to turn on `libxml_use_internal_errors()` and ignore it... Also, you're loading the document using `DomDocument::loadHtml()` right? – Frank Farmer Sep 11 '12 at 19:38
  • 1
    @FrankFarmer, the internal errors just suppresses the errors visually or from your error handler, it does nothing to actually resolve the issue. As for `loadHtml`, I am not. I am using the [`DOMDocumentFragment::appendXML`](http://www.php.net/manual/en/domdocumentfragment.appendxml.php) – wmarbut Sep 11 '12 at 19:41
  • 1
    See [this answer](http://stackoverflow.com/questions/4645738/domdocument-appendxml-with-special-characters) - HTML is not XML – Owlvark Sep 11 '12 at 19:44
  • @Owlvark joy, that explains the error... but it also doesn't provide a viable solution. – wmarbut Sep 11 '12 at 19:48
  • You have been given two "solutions" (suppressing errors, defining entities), what makes them not "viable"?.. – salathe Sep 11 '12 at 20:09
  • @salathe I don't view suppressing errors as a solution so much as a hack, but I guess it depends on your point of view. The defining entities seems to be out of the way, but yes it is viable. Thanks FrankFarmer and Owlvark for your contributions! – wmarbut Sep 11 '12 at 20:44

1 Answers1

7

The solution that I came up with is to use DomDocument::loadHtml as @FrankFarmer suggests and then to take the parsed nodes and import them into my current document. My implementation looks like this

/**
* Parses HTML into DOMElements
* @param string $html the raw html to transform
* @param \DOMDocument $doc the document to import the nodes into
* @return array an array of DOMElements on success or an empty array on failure
*/
protected function htmlToDOM($html, $doc) {
     $html = '<div id="html-to-dom-input-wrapper">' . $html . '</div>';
     $hdoc = DOMDocument::loadHTML($html);
     $child_array = array();
     try {
         $children = $hdoc->getElementById('html-to-dom-input-wrapper')->childNodes;
         foreach($children as $child) {
             $child = $doc->importNode($child, true);
             array_push($child_array, $child);
         }
     } catch (Exception $ex) {
         error_log($ex->getMessage(), 0);
     }
     return $child_array;
 }
wmarbut
  • 4,595
  • 7
  • 42
  • 72