Why is the Doctype being printed on my page?

Question

I've imported the content from a blogger account into a Wordpress blog.

I've had to apply some xpath and regex to remove some nasty formatting.

global $post;
$html = mb_convert_encoding($content, 'HTML-ENTITIES', "UTF-8");
$doc = new DOMDocument();@$doc - > loadHTML($html);
$xpath = new DOMXPath($doc);
foreach($xpath - > query('//br[not(preceding::text())]') as $node) {
    $node - > parentNode - > removeChild($node);
}
$nodes = $xpath - > query('//a[string-length(.) = 0]');
foreach($nodes as $node) {
    $node - > parentNode - > removeChild($node);
}
$nodes = $xpath - > query('//*[not(text() or node() or self::br)]');
foreach($nodes as $node) {
    $node - > parentNode - > removeChild($node);
}
remove_filter('the_content', 'wpautop');
$content = $doc - > saveHTML();
$content = ltrim($content, '<br>');
$content = strip_tags($content, '<br> <a> <iframe>');
$content = preg_replace(array('/(<br\s*\/?>\s*){1,}/'), array('<br/><br/>'), $content);
$content = str_replace('&nbsp;', ' ', $content);
$content = "<p>".implode("</p>\n\n<p>", preg_split('/\n(?:\s*\n)+/', $content))."</p>";
return $content;

For some reason though a random DOCTYPE is being printed inside my page and I don't know why.

<p>!DOCTYPE html PUBLIC &#8220;-//W3C//DTD HTML 4.0 Transitional//EN&#8221; &#8220;http://www.w3.org/TR/REC-html40/loose.dtd&#8221;>
    <br/>
    <br/>When the battle is on between contestants in a talent show, it gets really competitive when down to the last four. X-FactorUSAcontestant Marcus Canty knows this all too well as this is the stage he was voted off of the show earlier this year.
    <br/>
    <br/>
</p>

Could someone point me in a direction as to why this is happening?

Casimir et Hippolyte · Accepted Answer · 2014-01-21T15:43:57.267

4

When you load a piece of html code with DOMDocument, a Doctype, a html, head and body tag are added automatically (if missing) to this piece of html (and unclosed tags are closed) to make it a "valid" html document. So when you use saveHTML you save all of this. If I remember well, you can find several tricks to avoid this in the PHP manual (in the posts)

edited Jan 21 '14 at 15:43

answered Jan 21 '14 at 15:38

Casimir et Hippolyte

88,009
5
94
125

Ah I see, so I need to find a way to stop DOMDocument from applying its DOCTYPE, when using saveHTML? – UzumakiDev Jan 21 '14 at 15:43
1

@UzumakiDev: No you can't, see the php manual (or stackoverflow) to find a trick to save only a Fragment of the code. – Casimir et Hippolyte Jan 21 '14 at 15:46
@UzumakiDev: take a look here: http://stackoverflow.com/questions/6851620/how-to-prevent-the-doctype-from-being-added-to-the-html – Casimir et Hippolyte Jan 21 '14 at 16:29

Why is the Doctype being printed on my page?

1 Answers1