2

I'm trying to learn how to use PHP's DOM functions. As an exercise, I want to repair an invalid HTML fragment. So far, I've been able to produce a full document:

<?php

$fragment = '<div style="font-weight: bold">Lorem ipsum <div>dolor sit amet,
    <strong><em class=foo>luptate</strong></em>. Excepteur proident,
    <div class="bar">sunt in culpa</div> officia est laborum.';

$doc = new DOMDocument;
libxml_use_internal_errors(TRUE);
$doc->loadHTML($fragment);
libxml_use_internal_errors(FALSE);
$doc->formatOutput = TRUE;
echo $doc->saveHTML();

?>

... which prints:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div style="font-weight: bold">Lorem ipsum <div>dolor sit amet,
    <strong><em class="foo">luptate</em></strong>. Excepteur proident,
    <div class="bar">sunt in culpa</div> officia est laborum.</div>
</div></body></html>

My questions:

  1. Is there a way to print only the HTML that corresponds to the original fragment?
  2. Is there a more appropriate built-in library for such task?
Álvaro González
  • 142,137
  • 41
  • 261
  • 360

4 Answers4

1

This should work, but a bit ugly

$doc->loadHTML($fragment);
echo simplexml_import_dom( $doc->getElementsByTagName('div')->item(0) )->asXML();

output:

<div style="font-weight: bold">Lorem ipsum <div>dolor sit amet,
  <strong><em class="foo">luptate</em></strong>. Excepteur proident,
    <div class="bar">sunt in culpa</div> officia est laborum.</div></div>

slightly more elegant

$xpath   = new DOMXPath($doc);
$query   = '/html/body/*';        <-- always <html><body>...
$entries = $xpath->query($query);
foreach ($entries as $entry)
{
  echo simplexml_import_dom($entry)->asxml();
}
ajreal
  • 46,720
  • 11
  • 89
  • 119
  • I presume there isn't a direct mechanism (e.g., a `DOMNode::outerHTML()` method) and you have to write your own. First method assumes a specific structure but the second one works well (though I'm getting some new line chars converted to HTML entities, which is not strictly wrong but it's ugly). – Álvaro González Dec 30 '10 at 12:09
  • agree, this can be or might no helpful [preserveWhiteSpace](http://www.php.net/manual/en/class.domdocument.php#domdocument.props.preservewhitespace) – ajreal Dec 30 '10 at 12:16
1

It seems that latest PHP versions finally implement this:

How to return outer html of DOMDocument?

That way we can do this:

if( version_compare(PHP_VERSION, '5.3.6', '>=') ){
    $body = $dom->documentElement->firstChild;
    if( $body->hasChildNodes() ){
        foreach($body->childNodes as $node){
            echo $dom->saveHTML($node);
        }
    }
}

... or this:

if( version_compare(PHP_VERSION, '5.3.6', '>=') ){
    $body = $dom->getElementsByTagName('body')->item(0);
    if( $body->hasChildNodes() ){
        foreach($body->childNodes as $node){
            echo $dom->saveHTML($node);
        }
    }
}

Too bad we still need an ugly workaround for older versions.

Community
  • 1
  • 1
Álvaro González
  • 142,137
  • 41
  • 261
  • 360
0

You could run a function to replace the parts that you don't want that always appear such as:

$result = $doc->saveHTML();
$result = str_replace('<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"><html><body>', '', $result);
$result = str_replace('</body></html>', '', $result);

You could always try this class:

http://www.barattalo.it/html-fixer/

Which apparently will be as easy as this:

$dirty_html = ".....bad html here......";

$a = new HtmlFixer();
$clean_html = $a->getFixedHtml($dirty_html);

It all depends on what you will be doing with the information.

Flipper
  • 2,589
  • 3
  • 24
  • 32
  • Processing the final output with string functions kind of beats the purpose of using DOM ;-) Thanks for the link but, as I said, it's only an exercise so I can learn. – Álvaro González Dec 30 '10 at 10:05
0

Well, PHP >= 5.1 apparently also has a DocumentFragment, which has an appendXML function: http://php.net/manual/en/domdocumentfragment.appendxml.php. Maybe you can use that? I'm not sure if it has a string representation of itself, but who knows.

EDIT:

Well, that doesn't work :)

What you could do, though, is use SimpleXML, either directly or by creating a DOMElement and then using simplexml_import_dom($domelement)->asXML(): http://php.net/manual/en/function.simplexml-import-dom.php. Good luck! :)

Spiny Norman
  • 8,277
  • 1
  • 30
  • 55
  • Yep, DocumentFragment looks promising but I couldn't make use of it either. And SimpleXML generates a full document as well as far as I could determine. – Álvaro González Dec 30 '10 at 10:09