10

I'm manipulating a short HTML snippet with XPath; when I output the changed snippet back with $doc->saveHTML(), DOCTYPE gets added, and HTML / BODY tags wrap the output. I want to remove those, but keep all the children inside by only using the DOMDocument functions. For example:

$doc = new DOMDocument();
$doc->loadHTML('<p><strong>Title...</strong></p>
<a href="http://www....."><img src="http://" alt=""></a>
<p>...to be one of those crowning achievements...</p>');
// manipulation goes here
echo htmlentities( $doc->saveHTML() );

This produces:

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" ...>
<html><body>
<p><strong>Title...</strong></p>
<a href="http://www....."><img src="http://" alt=""></a>
<p>...to be one of those crowning achievements...</p>
</body></html>

I've attempted some of the simple tricks, such as:

# removes doctype
$doc->removeChild($doc->firstChild);

# <body> replaces <html>
$doc->replaceChild($doc->firstChild->firstChild, $doc->firstChild); 

So far that only removes DOCTYPE and replaces HTML with BODY. However, what remains is body > variable number of elements at this point.

How do I remove the <body> tag but keep all of its children, given that they will be structured variably, in a neat - clean way with PHP's DOM manipulation?

pp19dd
  • 3,625
  • 2
  • 16
  • 21
  • This is extremely easy to do with XSLT. Are you interested in an XSLT solution? – Dimitre Novatchev May 22 '12 at 13:22
  • Possible duplicate of [How to saveHTML of DOMDocument without HTML wrapper?](https://stackoverflow.com/questions/4879946/how-to-savehtml-of-domdocument-without-html-wrapper) – miken32 May 16 '19 at 03:32

5 Answers5

17

UPDATE

Here's a version that doesn't extend DOMDocument, though I think extending is the proper approach, since you're trying to achieve functionality that isn't built-in to the DOM API.

Note: I'm interpreting "clean" and "without workarounds" as keeping all manipulation to the DOM API. As soon as you hit string manipulation, that's workaround territory.

What I'm doing, just as in the original answer, is leveraging DOMDocumentFragment to manipulate multiple nodes all sitting at the root level. There is no string manipulation going on, which to me qualifies as not being a workaround.

$doc = new DOMDocument();
$doc->loadHTML('<p><strong>Title...</strong></p><a href="http://www....."><img src="http://" alt=""></a><p>...to be one of those crowning achievements...</p>');

// Remove doctype node
$doc->doctype->parentNode->removeChild($doc->doctype);

// Remove html element, preserving child nodes
$html = $doc->getElementsByTagName("html")->item(0);
$fragment = $doc->createDocumentFragment();
while ($html->childNodes->length > 0) {
    $fragment->appendChild($html->childNodes->item(0));
}
$html->parentNode->replaceChild($fragment, $html);

// Remove body element, preserving child nodes
$body = $doc->getElementsByTagName("body")->item(0);
$fragment = $doc->createDocumentFragment();
while ($body->childNodes->length > 0) {
    $fragment->appendChild($body->childNodes->item(0));
}
$body->parentNode->replaceChild($fragment, $body);

// Output results
echo htmlentities($doc->saveHTML());

ORIGINAL ANSWER

This solution is rather lengthy, but it's because it goes about it by extending the DOM in order to keep your end code as short as possible.

sliceOutNode is where the magic happens. Let me know if you have any questions:

<?php

class DOMDocumentExtended extends DOMDocument
{
    public function __construct( $version = "1.0", $encoding = "UTF-8" )
    {
        parent::__construct( $version, $encoding );

        $this->registerNodeClass( "DOMElement", "DOMElementExtended" );
    }

    // This method will need to be removed once PHP supports LIBXML_NOXMLDECL
    public function saveXML( DOMNode $node = NULL, $options = 0 )
    {
        $xml = parent::saveXML( $node, $options );

        if( $options & LIBXML_NOXMLDECL )
        {
            $xml = $this->stripXMLDeclaration( $xml );
        }

        return $xml;
    }

    public function stripXMLDeclaration( $xml )
    {
        return preg_replace( "|<\?xml(.+?)\?>[\n\r]?|i", "", $xml );
    }
}

class DOMElementExtended extends DOMElement
{
    public function sliceOutNode()
    {
        $nodeList = new DOMNodeListExtended( $this->childNodes );
        $this->replaceNodeWithNode( $nodeList->toFragment( $this->ownerDocument ) );
    }

    public function replaceNodeWithNode( DOMNode $node )
    {
        return $this->parentNode->replaceChild( $node, $this );
    }
}

class DOMNodeListExtended extends ArrayObject
{
    public function __construct( $mixedNodeList )
    {
        parent::__construct( array() );

        $this->setNodeList( $mixedNodeList );
    }

    private function setNodeList( $mixedNodeList )
    {
        if( $mixedNodeList instanceof DOMNodeList )
        {
            $this->exchangeArray( array() );

            foreach( $mixedNodeList as $node )
            {
                $this->append( $node );
            }
        }
        elseif( is_array( $mixedNodeList ) )
        {
            $this->exchangeArray( $mixedNodeList );
        }
        else
        {
            throw new DOMException( "DOMNodeListExtended only supports a DOMNodeList or array as its constructor parameter." );
        }
    }

    public function toFragment( DOMDocument $contextDocument )
    {
        $fragment = $contextDocument->createDocumentFragment();

        foreach( $this as $node )
        {
            $fragment->appendChild( $contextDocument->importNode( $node, true ) );
        }

        return $fragment;
    }

    // Built-in methods of the original DOMNodeList

    public function item( $index )
    {
        return $this->offsetGet( $index );
    }

    public function __get( $name )
    {
        switch( $name )
        {
            case "length":
                return $this->count();
            break;
        }

        return false;
    }
}

// Load HTML/XML using our fancy DOMDocumentExtended class
$doc = new DOMDocumentExtended();
$doc->loadHTML('<p><strong>Title...</strong></p><a href="http://www....."><img src="http://" alt=""></a><p>...to be one of those crowning achievements...</p>');

// Remove doctype node
$doc->doctype->parentNode->removeChild( $doc->doctype );

// Slice out html node
$html = $doc->getElementsByTagName("html")->item(0);
$html->sliceOutNode();

// Slice out body node
$body = $doc->getElementsByTagName("body")->item(0);
$body->sliceOutNode();

// Pick your poison: XML or HTML output
echo htmlentities( $doc->saveXML( NULL, LIBXML_NOXMLDECL ) );
echo htmlentities( $doc->saveHTML() );
matb33
  • 2,820
  • 1
  • 19
  • 28
  • I was hoping there was a cleaner solution that is both reliable and simple, but, it seems that there really isn't one out of the box. In terms of complexity, preg_match is a one-liner that would replace both of your solutions. Which is, of course, a PHP fault. – pp19dd May 24 '12 at 17:41
  • It's actually not PHP's fault. There is simply no method in the DOM API spec to do what you want. But the API was written in such a way as to be extensible with OO. The solution I provided does exactly that, just as the spec authors intended. Extend DOMDocument, write your new method(s), and away you go. As soon as you hit string manipulation (regex) for XML, you're in bad practice/hack territory. It won't hold up as your code base increases in complexity. – matb33 May 24 '12 at 21:35
  • 1
    Moving the children of one node to another is tricky, because the childNodes->length of the first node shrinks as each child is moved. So for loops that depend on the nodeList length will silently fail. Your while loop while ($html->childNodes->length > 0) {.. } seems like a good way of handling this, thanks! Took me a while to find this. – And Finally Jan 13 '16 at 11:35
11

saveHTML can output a subset of document, meaning we can ask it to output every child node one by one, by traversing body.

$doc = new DOMDocument();
$doc->loadHTML('<p><strong>Title...</strong></p>
<a href="http://google.com"><img src="http://google.com/img.jpeg" alt=""></a>
<p>...to be one of those crowning achievements...</p>');
// manipulation goes here

// Let's traverse the body and output every child node
$bodyNode = $doc->getElementsByTagName('body')->item(0);
foreach ($bodyNode->childNodes as $childNode) {
  echo $doc->saveHTML($childNode);
}

This might not be a most elegant solution, but it works. Alternatively, we can wrap all children nodes inside some container element (say a div) and output only that container (but container tag will be included in the output).

galymzhan
  • 5,505
  • 2
  • 29
  • 45
  • 1
    Since there's been a PHP change over saveHTML parameters, this doesn't work on some versions. However, switching saveHTML into saveXML does the trick. – pp19dd May 24 '12 at 17:38
2

Here how I've done it:

-- Quick helper function that gives you HTML contents for specific DOM element

function nodeContent($n, $outer=false) {
   $d = new DOMDocument('1.0');
   $b = $d->importNode($n->cloneNode(true),true);
   $d->appendChild($b); $h = $d->saveHTML();
   // remove outter tags
   if (!$outer) $h = substr($h,strpos($h,'>')+1,-(strlen($n->nodeName)+4));
   return $h;
}

-- Find body node in your doc and get its contents

$query = $xpath->query("//body")->item(0);
if($query)
{
    echo nodeContent($query);
}

UPDATE 1:

Some extra info: Since PHP/5.3.6, DOMDocument->saveHTML() accepts an optional DOMNode parameter similarly to DOMDocument->saveXML(). You can do

$xpath = new DOMXPath($doc);
$query = $xpath->query("//body")->item(0);
echo $doc->saveHTML($query);

for others, the helper function will help

Alexey Gerasimov
  • 2,131
  • 13
  • 17
  • Helper function is definitely useful to isolate fragments, but, the output is still a workaround - I mean, removing outer doctype/html/body tags can be done with preg_match( "/(.*)<\/body>/is", $doc->saveHTML(), $r ); ($r[1] contains the inner.) There may not be an answer for my question (saveHTML routine hardcoded?), but I was trying to avoid hacks and keep it simple and efficient. – pp19dd May 21 '12 at 19:58
  • I just updated the answer for you. If you're on 5.3.6 and up, you're good. For others, need to parse the data – Alexey Gerasimov May 21 '12 at 20:26
0

tl;dr

requires: PHP 5.4.0 and Libxml 2.6.0

$doc->loadHTML("<p>test</p>", LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);

explanation

http://php.net/manual/en/domdocument.loadhtml.php "Since PHP 5.4.0 and Libxml 2.6.0, you may also use the options parameter to specify additional Libxml parameters."

LIBXML_HTML_NOIMPLIED Sets HTML_PARSE_NOIMPLIED flag, which turns off the automatic adding of implied html/body... elements.

LIBXML_HTML_NODEFDTD Sets HTML_PARSE_NODEFDTD flag, which prevents a default doctype being added when one is not found.

Tricky
  • 410
  • 8
  • 17
-1

You have 2 ways to accomplish this:

$content = substr($content, strpos($content, '<html><body>') + 12); // Remove Everything Before & Including The Opening HTML & Body Tags.
$content = substr($content, 0, -14); // Remove Everything After & Including The Closing HTML & Body Tags.

Or even better is this way:

$dom->normalizeDocument();
$content = $dom->saveHTML();
Peyman Mohamadpour
  • 17,954
  • 24
  • 89
  • 100