7

My first guess was the PHP DOM classes (with the formatOutput parameter). However, I cannot get this block of HTML to be formatted and output correctly. As you can see, the indention and alignment is not correct.

$html = '
<html>
<body>
<div>

<div>

        <div>

                <p>My Last paragraph</p>
            <div>
                            This is another text block and some other stuff.<br><br>
                Again we will start a new paragraph
                            and some other stuff
                            <br>
        </div>
</div>
        <div>
                        <div>
                            <h1>Another Title</h1>
                                                    </div>
                        <p>Some text again <b>for sure</b></p>
                </div>
</div>
<div>
    <pre><code>
    <span>&lt;html&gt;</span>
        <span>&lt;head&gt;</span>
            <span>&lt;title&gt;</span>
                Page Title
            <span>&lt;/title&gt;</span>
            <span>&lt;/head&gt;</span>
    <span>&lt;/html&gt;</span>
    </code></pre>
</div>
</div>
</body>
</html>';

header('Content-Type: text/plain');
libxml_use_internal_errors(TRUE);

$dom = new DOMDocument;
$dom->preserveWhiteSpace = false;
$dom->formatOutput = true;
$dom->loadHTML($html);
print $dom->saveHTML();

Update: I added a pre-formatted code block to the example.

Xeoncross
  • 55,620
  • 80
  • 262
  • 364
  • 2
    For major blocks of text like that, you should use a [HEREDOC](http://php.net/heredoc) instead of a multi-line string. – Marc B Nov 03 '11 at 16:02
  • 4
    https://bugs.php.net/bug.php?id=48509 – gen_Eric Nov 03 '11 at 16:12
  • 2
    @MarcB, if you look at the post revisions you will see that is was a HEREDOC at first. However, markdown can't format HEREDOC strings. So, for your benefit it is now a multi-line string. *As if anyone would have a long string of HTML in their PHP file anyway...* :P – Xeoncross Nov 03 '11 at 16:39
  • @Rocket, I'm using `PHP 5.3.6-13ubuntu3.2 with Suhosin-Patch` and that fix was committed to PHP 5.3.3 according to the change log. – Xeoncross Nov 03 '11 at 16:44
  • 1
    Remove all indentation and unnecessary close tags. Having pretty, indented html is absolutely useless to the end user (actually it's bad because you waste bandwidth). Just check the source from Google or Facebook. – NullUserException Nov 03 '11 at 23:30
  • i don't get why you want to post a formatted HTML,it will take more time to download and the user won't notice any change. If you want to indent the code to insert it in a code tag you should do it with javascript: there are a lot of library to choose from and some of them have also build-in code hightlighting. If you're trying it for debug purpouses simply use a tool like firebug or the dev tools of chrome/explorer,they will show you a formatted and collassable html tree,a lot better than indented code. – Plokko Nov 04 '11 at 01:35
  • @Plokko I'm not cleaning up HTML before it's sent to the user. I'm parsing some HTML documents trying to clean them up for other purposes. I agree that all HTML sent to the user should have extra whitespace removed. – Xeoncross Nov 04 '11 at 14:57
  • 2
    When calling `loadHTML` you should also use the `LIBXML_NOERROR | LIBXML_NOWARNING` flags to avoid filling up the error stack and eating your RAM. Either that, or call `libxml_clear_errors()` after. – Alix Axel Jun 16 '13 at 18:26

2 Answers2

9

Here are some improvements over @hijarian answer:

LibXML Errors

If you don't call libxml_use_internal_errors(true), PHP will output all HTML errors found. However, if you call that function, the errors won't be suppressed, instead they will go to a pile that you can inspect by calling libxml_get_errors(). The problem with this is that it eats memory, and DOMDocument is known to be very picky. If you're processing lots of files in batch, you will eventually run out of memory. There are two solutions for this:

if (libxml_use_internal_errors(true) === true)
{
    libxml_clear_errors();
}

Since libxml_use_internal_errors(true) returns the previous value of this setting (default false), this has the effect of only clearing errors if you run it more than once (as in batch processing).

The other option is to pass the LIBXML_NOERROR | LIBXML_NOWARNING flags to the loadHTML() method. Unfortunately, for reasons that are unknown to me, this still leaves a couple of errors behind.

Bare in mind that DOMDocument will always output a error (even when using internal libxml errors and setting the suppressing flags) if you pass a empty (or blankish) string to the load*() methods.

Regex

The regex />\s*</im doesn't make a whole lot of sense, it's better to use ~>[[:space:]]++<~m to also catch \v (vertical tabs) and only replace if spaces actually exist (+ instead of *) without giving back (++) - which is faster - and to drop the case insensitve overhead (since whitespace has no case).

You may also want to normalize newlines to \n and other control characters (specially if the origin of the HTML is unknown), since a \r will come back as &#23; after saveXML() for instance.

DOMDocument::$preserveWhitespace is useless and unnecessary after running the above regex.

Oh, and I don't see the need to protect blank pre-like tags here. Whitespace-only snippets are useless.

Additional Flags for loadHTML()

  • LIBXML_COMPACT - "this may speed up your application without needing to change the code"
  • LIBXML_NOBLANKS - need to run more tests on this one
  • LIBXML_NOCDATA - need to run more tests on this one
  • LIBXML_NOXMLDECL - documented, but not implemented =(

UPDATE: Setting any of these options will have the effect of not formatting the output.

On saveXML()

The DOMDocument::saveXML() method will output the XML declaration. We need to manually purge it (since the LIBXML_NOXMLDECL isn't implemented). To do that, we could use a combination of substr() + strpos() to look for the first line break or even use a regex to clean it up.

Another option, that seems to have an added benefit is simply doing:

$dom->saveXML($dom->documentElement);

Another thing, if you have inline tags are are empty, such as the b, i or li in:

<b class="carret"></b>
<i class="icon-dashboard"></i> Dashboard
<li class="divider"></li>

The saveXML() method will seriously mangle them (placing the following element inside the empty one), messing your whole HTML. Tidy also has a similar problem, except that it just drops the node.

To fix that, you can use the LIBXML_NOEMPTYTAG flag along with saveXML():

$dom->saveXML($dom->documentElement, LIBXML_NOEMPTYTAG);

This option will convert empty (aka self-closing) tags to inline tags and allow empty inline tags as well.

Fixing HTML[5]

With all the stuff we did so far, our HTML output has two major problems now:

  1. no DOCTYPE (it was stripped when we used $dom->documentElement)
  2. empty tags are now inline tags, meaning one <br /> turned into two (<br></br>) and so on

Fixing the first one is fairly easy, since HTML5 is pretty permissive:

"<!DOCTYPE html>\n" . $dom->saveXML($dom->documentElement, LIBXML_NOEMPTYTAG);

To get our empty tags back, which are the following:

  • area
  • base
  • basefont (deprecated in HTML5)
  • br
  • col
  • command
  • embed
  • frame (deprecated in HTML5)
  • hr
  • img
  • input
  • keygen
  • link
  • meta
  • param
  • source
  • track
  • wbr

We can either use str_[i]replace in a loop:

foreach (explode('|', 'area|base|basefont|br|col|command|embed|frame|hr|img|input|keygen|link|meta|param|source|track|wbr') as $tag)
{
    $html = str_ireplace('>/<' . $tag . '>', ' />', $html);
}

Or a regular expression:

$html = preg_replace('~></(?:area|base(?:font)?|br|col|command|embed|frame|hr|img|input|keygen|link|meta|param|source|track|wbr)>\b~i', '/>', $html);

This is a costly operation, I haven't benchmarked them so I can't tell you which one performs better but I would guess preg_replace(). Additionally, I'm not sure if the case insensitive version is needed. I'm under the impression that XML tags are always lowercased. UPDATE: Tags are always lowercased.

On <script> and <style> Tags

These tags will always have their content (if existent) encapsulated into (uncommented) CDATA blocks, which will probably break their meaning. You'll have to replace those tokens with a regular expression.

Implementation

function DOM_Tidy($html)
{
    $dom = new \DOMDocument();

    if (libxml_use_internal_errors(true) === true)
    {
        libxml_clear_errors();
    }

    $html = mb_convert_encoding($html, 'HTML-ENTITIES', 'UTF-8');
    $html = preg_replace(array('~\R~u', '~>[[:space:]]++<~m'), array("\n", '><'), $html);

    if ((empty($html) !== true) && ($dom->loadHTML($html) === true))
    {
        $dom->formatOutput = true;

        if (($html = $dom->saveXML($dom->documentElement, LIBXML_NOEMPTYTAG)) !== false)
        {
            $regex = array
            (
                '~' . preg_quote('<![CDATA[', '~') . '~' => '',
                '~' . preg_quote(']]>', '~') . '~' => '',
                '~></(?:area|base(?:font)?|br|col|command|embed|frame|hr|img|input|keygen|link|meta|param|source|track|wbr)>~' => ' />',
            );

            return '<!DOCTYPE html>' . "\n" . preg_replace(array_keys($regex), $regex, $html);
        }
    }

    return false;
}
Community
  • 1
  • 1
Alix Axel
  • 151,645
  • 95
  • 393
  • 500
  • 1
    Just wondering if your answer deals with XML or HTML or both? You write about processing XML files and then converting them to HTML? Is that how it should be done? – eozzy Jun 23 '15 at 06:37
5

Here's the comment at the php.net: http://ru2.php.net/manual/en/domdocument.save.php#88630

It looks like when you load HTML from the string (like you did) DOMDocument becomes lazy and does not format anything in it.

Here's working solution to your problem:

// Clean your HTML by hand first
$html = preg_replace('/>\s*</im', '><', $html);
$dom = new DOMDocument;
$dom->loadHTML($html);
$dom->formatOutput = true;
$dom->preserveWhitespace = false;
// Use saveXML(), not saveHTML()
print $dom->saveXML();

Basically, you throw out the spaces between tags and use saveXML() instead of saveHTML(). saveHTML() just does not work in this situation. However, you get an XML declaration in first line of text.

Cole Tobin
  • 9,206
  • 15
  • 49
  • 74
hijarian
  • 2,159
  • 1
  • 28
  • 34
  • 2
    That is a good start, Using the XML output does help. However, it destroys the indention used in a pre-formatted code block. Nevertheless, perhaps I can just alter your regex to not include `` tags which are commonly placed around code in `
    ` blocks. Still, I'd like to know why the PHP DOM lib is broken. **Edit:** Yep,   
    `'/>\s*<(?!span)/i'` seems to work fine.
    – Xeoncross Nov 04 '11 at 15:05
  • @Xeoncross: David, the `>\s*<` regex should only trim if the pre-like code block consists entirely of spacing chars. If that's the case, I don't see why you would need to preserve redundant white space. What am I missing here? – Alix Axel Jun 16 '13 at 18:23
  • Why doesn't `saveHTML()` formatOutput anymore? Does it work on files? – Alix Axel Jun 16 '13 at 18:24