3

I want to remove new lines from some html (with php) except in <pre> tags where whitespace is obviously important.

Jonny Barnes
  • 515
  • 1
  • 12
  • 28
  • 5
    This is essentially html minification, which is the subject of another post: http://stackoverflow.com/questions/728260/html-minification. – David Andres Sep 13 '09 at 20:37

4 Answers4

11

It may be 3 years later, but... The following code will remove all line breaks and whitespace at long as it is outside of pre tags. Cheers!

function sanitize_output($buffer)
{
    $search = array(
        '/\>[^\S ]+/s', //strip whitespaces after tags, except space
        '/[^\S ]+\</s', //strip whitespaces before tags, except space
        '/(\s)+/s'  // shorten multiple whitespace sequences
        );
    $replace = array(
        '>',
        '<',
        '\\1'
        );

    $blocks = preg_split('/(<\/?pre[^>]*>)/', $buffer, null, PREG_SPLIT_DELIM_CAPTURE);
    $buffer = '';
    foreach($blocks as $i => $block)
    {
      if($i % 4 == 2)
        $buffer .= $block; //break out <pre>...</pre> with \n's
      else 
        $buffer .= preg_replace($search, $replace, $block);
    }

    return $buffer;
}

ob_start("sanitize_output");
smdrager
  • 7,327
  • 6
  • 39
  • 49
1

If the html is well formed, you can rely on the fact that <pre> tags aren't allowed to be nested. Make two passes: First you split the input into block of pre tags and everything else. You can use a regular expression for this task. Then you strip new lines from each non-pre block, and finally join them all back together.

Note that most html isn't well formed, so this approach may have some limits to where you can use it.

troelskn
  • 115,121
  • 27
  • 131
  • 155
1

Split the content up. This is easily done with...

$blocks = preg_split('/<(|\/)pre>/', $html);

Just be careful, because the $blocks elements won't contain the pre opening and closing tags. I feel that assume the HTML is valid is acceptable, and therefore you can expect the pre-blocks to be every other element in the array (1, 3, 5, ...). Easily tested with $i % 2 == 1.

Example "complete" script (modify as you need to)...

<?php
//out example HTML file - could just as easily be a read in file
$html = <<<EOF
<html>
  <head>
    <title>test</title>
  </head>
  <body>
    <h1>Title</h1>
    <p>
      This is an article about...
    </p>
    <pre>
      line one
      line two
      line three
    </pre>
    <div style="float: right:">
      random
    </div>
    </body>
</html>
EOF;

//break it all apart...
$blocks = preg_split('/<(|\/)pre>/', $html);

//and put it all back together again
$html = ""; //reuse as our buffer
foreach($blocks as $i => $block)
{
  if($i % 2 == 1)
    $html .= "\n<pre>$block</pre>\n"; //break out <pre>...</pre> with \n's
  else 
    $html .= str_replace(array("\n", "\r"), "", $block, $c);
}

echo $html;
?>
Sam Bisbee
  • 4,461
  • 20
  • 25
0

The most upvoted answer depends on the html being "well formed".

It uses a % modulus operator to define which key to ignore which isn't suitable nor sustainable as most of the time the html isn't "well formed".

The basic idea is the same tho instead we will be setting a variable when we found an opening <pre> tag.

The next iteration will be the content ($key + 1), and the one after the closing tag ($key + 2).

Based on that logic we can ignore the content through comparing our current $key with our last $key + 1.

<?php

function sanitize_output() {

    ob_start( function ( $buffer ) {

        /**
         * preg_replace() UTF-8 troubleshooting.
         * 
         * Replacing empty space with preg_replace causes invalid characters with UTF-8.
         * As preg_replace() depends on the current defined locale, characters not supported will be returned as � invalid.
         * The /u flag is used to make regex unicode aware.
         * 
         * @see https://stackoverflow.com/a/74101068/3645650
         */
        $replace = array(
            '/\n/smu'       => '',      //Remove new lines.
            '/(\s)+/smu'    => '\\1',   //Replace multiple spaces with a single one.
        );
        
        $buffer = preg_split( '/(<\/?pre[^>]*>)/', $buffer, null, PREG_SPLIT_DELIM_CAPTURE );

        foreach ( $buffer as $key => $value ) {

            /**
             * If the $key is a <pre> opening tag.
             * $key + 1 is the pre tag's content.
             * $key + 2 is the pre closing tag.
             */
            if ( false !== stripos( $value, '<pre' ) ) {

                $k = $key;

            };

            if ( $k + 1 === $key ) {

                unset( $k );

                continue;

            };

            $buffer[ $key ] = preg_replace( array_keys( $replace ), array_values( $replace ), $value );

        };

        return implode( '', $buffer );

    } );

};
amarinediary
  • 4,930
  • 4
  • 27
  • 45