7

I have a string in variable $html that contains minified HTMl code, all in one line, like:

$html = '<body><div><p>hello</p><div></body>';

How do I beautify/pretty print the HTML so that my variable becomes like:

 $html = '<body>
             <div>
               <p>hello</p>
             <div>
          </body>';

I know the tidy extension is a possibility, but how can this be done without an extension.

EDIT: PLEASE read the question. I am not asking how to beautify HTML code via some external site. I am asking how to do it in PHP, specifically targeting the string variable.

Alex Andrei
  • 7,315
  • 3
  • 28
  • 42
Henrik Petterson
  • 6,862
  • 20
  • 71
  • 155

3 Answers3

17

Using DomDocument we load the html passing the LIBXML_HTML_NOIMPLIED flag
which will prevent the loadHTML method to add the extra html wrapper.

We save as XML to get the nice indentation, while passing the $dom->documentElement parameter to prevent the XML header.

$html = '<body><div><p>hello</p><div></body>';

$dom = new DOMDocument();

$dom->preserveWhiteSpace = false;
$dom->loadHTML($html,LIBXML_HTML_NOIMPLIED);
$dom->formatOutput = true;


print $dom->saveXML($dom->documentElement);

This will output

<body>
  <div>
    <p>hello</p>
    <div/>
  </div>
</body>

Notice that the HTML was fixed for you as the second div should have been a closing tag, I assume.

If we pass the proper HTML as the input string, the output will be as you require

$html = '<body><div><p>hello</p></div></body>';

<body>
  <div>
    <p>hello</p>
  </div>
</body>
Alex Andrei
  • 7,315
  • 3
  • 28
  • 42
  • Thank you very much. I ended up going with the following approach: http://pastebin.com/Q94UbgSW Which is very similar to yours. Does yours do anything different (or better for that matter)? – Henrik Petterson Jan 01 '16 at 14:27
  • I just checked the output of your function and I see it has the `DOCTYPE` declaration and the `html` tag wrapper. Also no indentation. – Alex Andrei Jan 01 '16 at 14:30
  • If your solution works for you, then use it :) It's all about what ultimately works for **you** – Alex Andrei Jan 01 '16 at 14:31
  • I know but what I am asking is, does your approach work better generally? I am scrapping various HTML from sites, storing it in the variable, and tidying it up with this function. Will yours work better in terms of compatibility with different type of HTML code? – Henrik Petterson Jan 01 '16 at 14:32
  • not really, the only difference between my approach and yours is what I pointed out above, the lack of indentation, extra `DOCTYPE` and `html`. If you pass incorrect `html` it will still get fixed for you. – Alex Andrei Jan 01 '16 at 14:34
  • Thank you very much for the explanation! – Henrik Petterson Jan 01 '16 at 14:35
  • This is genius, very nice solution! – Evochrome Jan 01 '16 at 15:28
  • If you've to deal with messy code that is not validating, then this way is a limited option. While you get hints about faults in the code, display might break and you've to work from top to bottom to solve the issues and to get shown the whole page. For validating it's good but for partial work this is not an option. – David Aug 17 '17 at 09:09
4

I've used DOMDocument but it seems that it is very sensitive to broken html and html errors.

Anyhow, DOMDocument require dom extension so I've used tidy php extension as it works perfect to me - it fix html errors and prettify html as well.

Use code from example:

$config = array(
       'indent'         => true,
       'output-xhtml'   => true,
       'wrap'           => 200);

 // Tidy
$tidy = new \tidy;
$tidy->parseString($html, $config, 'utf8');
$tidy->cleanRepair();

// Output
echo $tidy;
Serhii Polishchuk
  • 1,442
  • 17
  • 22
0

While I'm not aware of any pre-built things to do that (most parsers provide functionality to minify HTML, not prettify it), it shouldn't be too hard to do yourself:

  1. Parse the HTML. DOMDocument might be a good idea for that.
  2. Recurse through the document. For each element...
    1. Output the opening tag - you will need to loop through attributes too
    2. If it contains no child elements (but possibly child text), output its textContent
      Otherwise, recurse into the element, outputting each child node, be it text or element. Indent appropriately based on number of recursion levels.
    3. Output the closing tag.

And... that's it. Shouldn't be too difficult :)

Niet the Dark Absol
  • 320,036
  • 81
  • 464
  • 592