loadHTML LIBXML_HTML_NOIMPLIED on an html fragment generates incorrect tags

Question

Using the LIBXML_HTML_NOIMPLIED flag with an html fragment generates incorrect tags:

$str = '<p>Lorem ipsum dolor sit amet.</p><p>Nunc vel vehicula ante.</p>';
$doc = new DOMDocument();
$doc->loadHTML($str, LIBXML_HTML_NOIMPLIED | LIBXML_HTML_NODEFDTD);
echo $doc->saveHTML();

Outputs:

<p>Lorem ipsum dolor sit amet.<p>Nunc vel vehicula ante.</p></p>

I have found hacks to work around this using regexes, but that defeats the purpose of using DOM. I have tested this with several versions of libxml and php, the latest with libxml 2.9.2, php 5.6.7 (Debian Jessy). Any suggestions appreciated.

Unfortunately, Libxml re-arranges the document. If you note, it also prints a warning: the issue is that there isn't a single root element. You can either wrap the content in a `div` as the answer suggests or remove the `LIBXML_HTML_NOIMPLIED` option and use any other solution from http://stackoverflow.com/questions/4879946/how-to-savehtml-of-domdocument-without-html-wrapper — Alessandro Vendruscolo, Sep 22 '15 at 07:31

score 27 · Answer 1 · edited May 23 '17 at 11:54

The re-arrangement is done by the LIBXML_HTML_NOIMPLIED option you're using. Looks like it's not stable enough for your case.

Also you might want to not use it for portablility reasons, for example I've got one PHP 5.4.36 with Libxml 2.7.8 at hand that is not supporting LIBXML_HTML_NOIMPLIED (Libxml >= 2.7.7) but later LIBXML_HTML_NODEFDTD (Libxml >= 2.7.8) option.

I know this way of dealing with it. When you load the fragment, you wrap it into a <div> element:

$doc->loadHTML("<div>$str</div>");

This helps to guide DOMDocument on the structure you want.

You can then extract this container from the document itself:

$container = $doc->getElementsByTagName('div')->item(0);
$container = $container->parentNode->removeChild($container);

And then remove all children from the document:

while ($doc->firstChild) {
    $doc->removeChild($doc->firstChild);
}

Now the document is completely empty and you're now able to append children again. Luckily there is the <div> container element we removed earlier, so we can add from it:

while ($container->firstChild ) {
    $doc->appendChild($container->firstChild);
}

The fragment then can be retrieved with the known saveHTML method:

echo $doc->saveHTML();

Which gives in your scenario:

<p>Lorem ipsum dolor sit amet.</p><p>Nunc vel vehicula ante.</p>

This methodology is a little different from the existing material here on site (see the references I give below), so the example at once:

$str = '<p>Lorem ipsum dolor sit amet.</p><p>Nunc vel vehicula ante.</p>';

$doc = new DOMDocument();
$doc->loadHTML("<div>$str</div>");

$container = $doc->getElementsByTagName('div')->item(0);
$container = $container->parentNode->removeChild($container);
while ($doc->firstChild) {
    $doc->removeChild($doc->firstChild);
}

while ($container->firstChild ) {
    $doc->appendChild($container->firstChild);
}

echo $doc->saveHTML();

I also really recommend the reference question on How to saveHTML of DOMDocument without HTML wrapper? for a further read as well as the one about inner-html

References

Thank you for your response, and a solution without regexes. I have rarely used DOMdocument in the past, but after a week of testing with large html fragments, I am finding the ad nauseum advice to use it over regexes dubious. — J.L. Hill, Apr 07 '15 at 20:41
Hmm, depends on what you want to achieve. There are times you need (or want have) a HTML parser. In PHP that means either **DOMDocument** or the **tidy** class (which also offers nodes but has no xpath). When you ask me personally: You can't beat xpath with regexes when it comes to HTML. I also try to give answers that offer a path for alternation between the two, for example here: http://stackoverflow.com/a/29481904/367456 — hakre, Apr 07 '15 at 20:48

Nicholas Shanks · Answer 2 · 2017-01-12T07:02:55.897

11

The LIBXML_HTML_NOIMPLIED option is not buggy, it's just badly documented. To fix the problem, wrap your input string with <html>…</html>, process your HTML, and then strip that off the output. LibXML requires a root node, and is treating the first element it finds as the root node, deleting the (incorrectly located) closing tag it finds half-way through, and then outputting the closing tag of the first element it found at the end of the document. It's logical when you see it from (Lib)XML's perspective.

edited Jan 12 '17 at 07:02

answered Apr 11 '16 at 11:23

Nicholas Shanks

10,623
4
56
80

LIBXML_HTML_NOIMPLIED also messes up the HTML code by removing the tabs, indents and the line breaks – Zoltán Süle Jan 10 '18 at 11:20

@ZoltánSüle Szia. Does that even apply within

 and  elements? If so, that is a bug I'd say. My own HTML usage is minified so I would have missed this.</plaintext></pre></span>
        –&nbsp;<a href="../../users/760706/nicholas-shanks" title="10,623 reputation" class="comment-user ">Nicholas Shanks</a>
                <span class="comment-date" dir="ltr"><a class="comment-link" href="../../questions/29493678/loadhtml-libxml-html-noimplied-on-an-html-fragment-generates-incorrect-tags#comment83353699_36547335"><span title="2018-01-10T12:06:04.897 License: CC BY-SA 3.0" class="relativetime-clean">Jan 10 '18 at 12:06</span></a></span>
    </div>
</div>
                    
                </li>
                
                <li id="comment-83354735" class="comment js-comment " data-comment-id="83354735" data-comment-owner-id="5356216" data-comment-score="0">
                    <div class="js-comment-actions comment-actions">
                        <div class="comment-score js-comment-edit-hide">
                            
                        </div>
                    </div>
                    
                    <div class="comment-text js-comment-text-and-form">
    <a name="comment83354735_36547335"></a>
    <div class="comment-body js-comment-edit-hide">
        <span class="comment-copy">Szia. I just tried it out. If I wrap the HTML code with <pre> or <plaintext> then LIBXML_HTML_NOIMPLIED doesn't remove the tabs and the indents.</plaintext></pre></span>
        –&nbsp;<a href="../../users/5356216/zoltan-sule" title="1,482 reputation" class="comment-user ">Zoltán Süle</a>
                <span class="comment-date" dir="ltr"><a class="comment-link" href="../../questions/29493678/loadhtml-libxml-html-noimplied-on-an-html-fragment-generates-incorrect-tags#comment83354735_36547335"><span title="2018-01-10T12:32:11.533 License: CC BY-SA 3.0" class="relativetime-clean">Jan 10 '18 at 12:32</span></a></span>
    </div>
</div>
                    
                </li>
                
            </ul>
        </div>
        
    </div>
    </div>
                
            </div>
            
        </div>
    </div>
    
    <div id="sidebar" class="show-votes" role="complementary" aria-label="sidebar">
        
            
<div class="module sidebar-linked">
<h4 id="h-linked">Linked</h4>

<div class="linked">
    <div class="spacer">
    <a title="Vote score (upvotes - downvotes)"><div class="answer-votes  default">132</div></a>
    <a href="../../questions/4879946/how-to-savehtml-of-domdocument-without-html-wrapper" class="question-hyperlink">How to saveHTML of DOMDocument without HTML wrapper?</a>
    </div>
</div>

<div class="linked">
    <div class="spacer">
    <a title="Vote score (upvotes - downvotes)"><div class="answer-votes  default">6</div></a>
    <a href="../../questions/26565172/why-does-domdocument-nest-paragraph-p-tags" class="question-hyperlink">Why does DOMDocument nest paragraph (&lt;p&gt;) tags?</a>
    </div>
</div>

<div class="linked">
    <div class="spacer">
    <a title="Vote score (upvotes - downvotes)"><div class="answer-votes answered-accepted default">3</div></a>
    <a href="../../questions/47397559/php-domdocument-savehtml-not-encoding-cyrillic-correctly" class="question-hyperlink">PHP DOMDocument saveHTML not encoding cyrillic correctly</a>
    </div>
</div>

<div class="linked">
    <div class="spacer">
    <a title="Vote score (upvotes - downvotes)"><div class="answer-votes  default">2</div></a>
    <a href="../../questions/49755468/stray-end-tag-source-with-php-domdocument" class="question-hyperlink">Stray end tag &lt;/source&gt; with PHP DOMDocument</a>
    </div>
</div>

<div class="linked">
    <div class="spacer">
    <a title="Vote score (upvotes - downvotes)"><div class="answer-votes answered-accepted default">0</div></a>
    <a href="../../questions/56724906/php-domdocument-element-ending-up-within-another" class="question-hyperlink">php DOMDocument: element ending up within another</a>
    </div>
</div>

<div class="linked">
    <div class="spacer">
    <a title="Vote score (upvotes - downvotes)"><div class="answer-votes  default">-2</div></a>
    <a href="../../questions/59219505/how-to-insert-in-x-amount-of-paragraph-with-str-replace" class="question-hyperlink">how to insert in x amount of paragraph with str_replace ()</a>
    </div>
</div>

<div class="linked">
    <div class="spacer">
    <a title="Vote score (upvotes - downvotes)"><div class="answer-votes  default">0</div></a>
    <a href="../../questions/71990206/domdocument-expands-first-tag" class="question-hyperlink">DOMDocument expands first tag</a>
    </div>
</div>

</div>

        

        
            
<div class="module sidebar-linked">
<h4 id="h-linked">Related</h4>

<div class="linked">
    <div class="spacer">
    <a title="Vote score (upvotes - downvotes)"><div class="answer-votes  default">0</div></a>
    <a href="../../questions/71990206/domdocument-expands-first-tag" class="question-hyperlink">DOMDocument expands first tag</a>
    </div>
</div>

</div>

        
    </div>
    

            </div>
        </div>
        <script src="../../static/js/stack-icons.js"></script>
        <script src="../../static/js/fromnow.js"></script>
        
    </body>
</html>

loadHTML LIBXML_HTML_NOIMPLIED on an html fragment generates incorrect tags

2 Answers2

References