Fix HTML fragment

Question

I'm trying to learn how to use PHP's DOM functions. As an exercise, I want to repair an invalid HTML fragment. So far, I've been able to produce a full document:

<?php

$fragment = '<div style="font-weight: bold">Lorem ipsum <div>dolor sit amet,
    <strong><em class=foo>luptate</strong></em>. Excepteur proident,
    <div class="bar">sunt in culpa</div> officia est laborum.';

$doc = new DOMDocument;
libxml_use_internal_errors(TRUE);
$doc->loadHTML($fragment);
libxml_use_internal_errors(FALSE);
$doc->formatOutput = TRUE;
echo $doc->saveHTML();

?>

... which prints:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
<html><body><div style="font-weight: bold">Lorem ipsum <div>dolor sit amet,
    <strong><em class="foo">luptate</em></strong>. Excepteur proident,
    <div class="bar">sunt in culpa</div> officia est laborum.</div>
</div></body></html>

My questions:

Is there a way to print only the HTML that corresponds to the original fragment?
Is there a more appropriate built-in library for such task?

ajreal · Accepted Answer · 2010-12-30T10:38:38.793

1

This should work, but a bit ugly

$doc->loadHTML($fragment);
echo simplexml_import_dom( $doc->getElementsByTagName('div')->item(0) )->asXML();

output:

<div style="font-weight: bold">Lorem ipsum <div>dolor sit amet,
  <strong><em class="foo">luptate</em></strong>. Excepteur proident,
    <div class="bar">sunt in culpa</div> officia est laborum.</div></div>

slightly more elegant

$xpath   = new DOMXPath($doc);
$query   = '/html/body/*';        <-- always <html><body>...
$entries = $xpath->query($query);
foreach ($entries as $entry)
{
  echo simplexml_import_dom($entry)->asxml();
}

edited Dec 30 '10 at 10:38

answered Dec 30 '10 at 10:26

ajreal

46,720
11
89
119

I presume there isn't a direct mechanism (e.g., a `DOMNode::outerHTML()` method) and you have to write your own. First method assumes a specific structure but the second one works well (though I'm getting some new line chars converted to HTML entities, which is not strictly wrong but it's ugly). – Álvaro González Dec 30 '10 at 12:09
agree, this can be or might no helpful [preserveWhiteSpace](http://www.php.net/manual/en/class.domdocument.php#domdocument.props.preservewhitespace) – ajreal Dec 30 '10 at 12:16

score 1 · Answer 2 · edited May 23 '17 at 12:04

It seems that latest PHP versions finally implement this:

How to return outer html of DOMDocument?

That way we can do this:

if( version_compare(PHP_VERSION, '5.3.6', '>=') ){
    $body = $dom->documentElement->firstChild;
    if( $body->hasChildNodes() ){
        foreach($body->childNodes as $node){
            echo $dom->saveHTML($node);
        }
    }
}

... or this:

if( version_compare(PHP_VERSION, '5.3.6', '>=') ){
    $body = $dom->getElementsByTagName('body')->item(0);
    if( $body->hasChildNodes() ){
        foreach($body->childNodes as $node){
            echo $dom->saveHTML($node);
        }
    }
}

Too bad we still need an ugly workaround for older versions.

score 0 · Answer 3 · answered Dec 30 '10 at 09:30

You could run a function to replace the parts that you don't want that always appear such as:

$result = $doc->saveHTML();
$result = str_replace('<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd"><html><body>', '', $result);
$result = str_replace('</body></html>', '', $result);

You could always try this class:

http://www.barattalo.it/html-fixer/

Which apparently will be as easy as this:

$dirty_html = ".....bad html here......";

$a = new HtmlFixer();
$clean_html = $a->getFixedHtml($dirty_html);

It all depends on what you will be doing with the information.

Processing the final output with string functions kind of beats the purpose of using DOM ;-) Thanks for the link but, as I said, it's only an exercise so I can learn. — Álvaro González, Dec 30 '10 at 10:05

Spiny Norman · Answer 4 · 2010-12-30T09:40:11.200

0

Well, PHP >= 5.1 apparently also has a DocumentFragment, which has an appendXML function: http://php.net/manual/en/domdocumentfragment.appendxml.php. Maybe you can use that? I'm not sure if it has a string representation of itself, but who knows.

EDIT:

Well, that doesn't work :)

What you could do, though, is use SimpleXML, either directly or by creating a DOMElement and then using simplexml_import_dom($domelement)->asXML(): http://php.net/manual/en/function.simplexml-import-dom.php. Good luck! :)

edited Dec 30 '10 at 09:40

answered Dec 30 '10 at 09:31

Spiny Norman

8,277
1
30
55

Yep, DocumentFragment looks promising but I couldn't make use of it either. And SimpleXML generates a full document as well as far as I could determine. – Álvaro González Dec 30 '10 at 10:09

Fix HTML fragment

4 Answers4