16

Google pages suggest you to minify HTML, that is, remove all the unnecessary spaces. CodeIgniter does have the feature of giziping output or it can be done via .htaccess. But still I also would like to remove unnecessary spaces from the final HTML output as well.

I played a bit with this piece of code to do it, and it seems to work. This does indeed result in HTML that is without excess spaces and removes other tab formatting.

class Welcome extends CI_Controller 
{
    function _output()
    {
        echo preg_replace('!\s+!', ' ', $output);
    }

    function index(){
    ...
    }
}

The problem is there may be tags like <pre>,<textarea>, etc.. which may have spaces in them and a regular expression should remove them. So, how do I remove excess space from the final HTML, without effecting spaces or formatting for these certain tags using a regular expression?

Thanks to @Alan Moore got the answer, this worked for me

echo preg_replace('#(?ix)(?>[^\S ]\s*|\s{2,})(?=(?:(?:[^<]++|<(?!/?(?:textarea|pre)\b))*+)(?:<(?>textarea|pre)\b|\z))#', ' ', $output);

ridgerunner did a very good job of analyzing this regular expression. I ended up using his solution. Cheers to ridgerunner.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
Aman
  • 1,624
  • 3
  • 15
  • 25
  • 12
    Do not do HTML using regular expressions. – SLaks Mar 15 '11 at 13:23
  • Infinite upvotes to you, SLaks. – Delan Azabani Mar 15 '11 at 13:24
  • ok, so what could be a good way of reformatting final html output then? – Aman Mar 15 '11 at 13:30
  • 2
    Like the two others comments above, I suggest you to read this great answer: http://stackoverflow.com/questions/728260/html-minification/1102101#1102101 Don't do that. Do much more before.. – guillaumepotier Mar 15 '11 at 13:34
  • possible duplicate... http://stackoverflow.com/questions/3480343/possible-to-use-codeigniter-output-compression-with-pre-to-display-code-blocks – jondavidjohn Mar 15 '11 at 14:20
  • I second the "no regex on HTML" advice. HTML is just way too complex to do with regexes. Use a proper HTML parser to do the job, or write dense HTML in the first place. If you use XHTML, you can load the document in an XML parser and then spit it back out. Most XML libraries will allow you to remove unnecessary whitespace in the process. – tdammers Mar 15 '11 at 14:23
  • Why? Use output compression and waste CPU cycles for somethin else. – rik Mar 15 '11 at 14:37
  • **Related:** http://stackoverflow.com/a/33844247/1163000 – Taufik Nurrohman Nov 21 '15 at 14:06

3 Answers3

51

For those curious about how Alan Moore's regex works (and yes, it does work), I've taken the liberty of commented it so it can be read by mere mortals:

function process_data_alan($text) // 
{
    $re = '%# Collapse ws everywhere but in blacklisted elements.
        (?>             # Match all whitespans other than single space.
          [^\S ]\s*     # Either one [\t\r\n\f\v] and zero or more ws,
        | \s{2,}        # or two or more consecutive-any-whitespace.
        ) # Note: The remaining regex consumes no text at all...
        (?=             # Ensure we are not in a blacklist tag.
          (?:           # Begin (unnecessary) group.
            (?:         # Zero or more of...
              [^<]++    # Either one or more non-"<"
            | <         # or a < starting a non-blacklist tag.
              (?!/?(?:textarea|pre)\b)
            )*+         # (This could be "unroll-the-loop"ified.)
          )             # End (unnecessary) group.
          (?:           # Begin alternation group.
            <           # Either a blacklist start tag.
            (?>textarea|pre)\b
          | \z          # or end of file.
          )             # End alternation group.
        )  # If we made it here, we are not in a blacklist tag.
        %ix';
    $text = preg_replace($re, " ", $text);
    return $text;
}

I'm new around here, but I can see right off that Alan is quite good at regex. I would only add the following suggestions.

  1. There is an unnecessary capture group which can be removed.
  2. Although the OP did not say so, the <SCRIPT> element should be added to the <PRE> and <TEXTAREA> blacklist.
  3. Adding the 'S' PCRE "study" modifier speeds up this regex by about 20%.
  4. There is an alternation group in the lookahead which is ripe for applying Friedl's "unrolling-the-loop" efficiency construct.
  5. On a more serious note, this same alternation group: (i.e. (?:[^<]++|<(?!/?(?:textarea|pre)\b))*+) is susceptible to excessive PCRE recursion on large target strings, which can result in a stack-overflow causing the Apache/PHP executable to silently seg-fault and crash with no warning. (The Win32 build of Apache httpd.exe is particularly susceptible to this because it has only 256KB stack compared to the *nix executables, which are typically built with 8MB stack or more.) Philip Hazel (the author of the PCRE regex engine used in PHP) discusses this issue in the documentation: PCRE DISCUSSION OF STACK USAGE. Although Alan has correctly applied the same fix as Philip shows in this document (applying a possessive plus to the first alternative), there will still be a lot of recursion if the HTML file is large and has a lot of non-blacklisted tags. e.g. On my Win32 box (with an executable having a 256KB stack), the script blows up with a test file of only 60KB. Note also that PHP unfortunately does not follow the recommendations and sets the default recursion limit way too high at 100000. (According to the PCRE docs this should be set to a value equal to the stack size divided by 500).

Here is an improved version which is faster than the original, handles larger input, and gracefully fails with a message if the input string is too large to handle:

// Set PCRE recursion limit to sane value = STACKSIZE / 500
// ini_set("pcre.recursion_limit", "524"); // 256KB stack. Win32 Apache
ini_set("pcre.recursion_limit", "16777");  // 8MB stack. *nix
function process_data_jmr1($text) // 
{
    $re = '%# Collapse whitespace everywhere but in blacklisted elements.
        (?>             # Match all whitespans other than single space.
          [^\S ]\s*     # Either one [\t\r\n\f\v] and zero or more ws,
        | \s{2,}        # or two or more consecutive-any-whitespace.
        ) # Note: The remaining regex consumes no text at all...
        (?=             # Ensure we are not in a blacklist tag.
          [^<]*+        # Either zero or more non-"<" {normal*}
          (?:           # Begin {(special normal*)*} construct
            <           # or a < starting a non-blacklist tag.
            (?!/?(?:textarea|pre|script)\b)
            [^<]*+      # more non-"<" {normal*}
          )*+           # Finish "unrolling-the-loop"
          (?:           # Begin alternation group.
            <           # Either a blacklist start tag.
            (?>textarea|pre|script)\b
          | \z          # or end of file.
          )             # End alternation group.
        )  # If we made it here, we are not in a blacklist tag.
        %Six';
    $text = preg_replace($re, " ", $text);
    if ($text === null) exit("PCRE Error! File too big.\n");
    return $text;
}

p.s. I am intimately familiar with this PHP/Apache seg-fault problem, as I was involved with helping the Drupal community while they were wrestling with this issue. See: Optimize CSS option causes php cgi to segfault in pcre function "match". We also experienced this with the BBCode parser on the FluxBB forum software project.

Hope this helps.

ridgerunner
  • 33,777
  • 5
  • 57
  • 69
  • Wow that was quite in depth analysis, I didn't knew all these details. Thanx a lot, I will try your regex. – Aman Mar 17 '11 at 04:41
  • could i have the test file that you were using ? – Aman Mar 17 '11 at 10:28
  • @Aman Yes, but it will be some time before I post it (the file is an article in progress (in HTML)...) – ridgerunner Mar 18 '11 at 05:02
  • am I the only one who gets an error 324 when render this regex via php? My error log says: child pid 4736 exit signal Segmentation fault (11) ?? :S – william Jul 26 '12 at 15:46
  • 2
    @william - "render"? error 324 from what - `httpd.exe`? `php.exe`? Will need more information to proceed. First try setting `pcre.recursion_limit` to 524 (the script currently sets it to 16777). Just comment out the one line and uncomment the other. – ridgerunner Jul 26 '12 at 17:36
  • Oh sorry - From apache server. At least I get the error info: "PCRE Error! File too big.". – william Jul 26 '12 at 17:48
  • @ridgerunner I'm using this code in a c++ project I'm working on. Some small changes and then it worked just fine. But I notice that we end up with "> <". Do you think it's wise to extend the already existing regex to prevent this in some way, or would you run a new one after the first regex has occurred? – superhero Apr 21 '13 at 16:02
  • @Erik Landvall - without seeing the changes you made, there is no way for me to answer your question. (Also, the solution above is PHP and you say you are using C++). Maybe you can post a new question specific to the issues you are having. Note that I don't have access to a C++ compiler that does regex, so I won't be able to help much. – ridgerunner Apr 21 '13 at 17:00
  • @ridgerunner I usually code in PHP but trying out something new :) If you would be so kind and have a look at: http://stackoverflow.com/q/16134469/570796 - The difference shouldn't be overwhelming. – superhero Apr 21 '13 at 18:10
  • @Erik Landvall - Ok, I'll take a look, but I'm busy today. However, after a quick glance just now, I'd have to agree that a proper HTML parser would be your best solution. – ridgerunner Apr 21 '13 at 19:35
  • @ridgerunner Yea, I'm looking in to a parser by boost as we speak. The original issue about the > < combination has alredy been explained though! – superhero Apr 22 '13 at 06:21
  • Hi, very nice script. I've just implemented into my project. Would you say this is a good practice to minify HTML like this? Or would the server load be heavier, thus slowing down the entire process anyway? My pages are very very long. I don't get the memory warning limit though... What do you think? – chocolata Feb 05 '14 at 21:29
  • 1
    @maartenmachiels - Sorry but I can't offer you an opinion one way or the other. If you do use regex, be sure to read and take safeguards as recommended in [my answer to a similar question](http://stackoverflow.com/a/7627962/433790). Stack overflows and silent crashing of executables is not good! – ridgerunner Feb 06 '14 at 16:28
  • How to add code to remove js comments in this RegEx? – Umair Hamid Jan 28 '15 at 12:32
  • Be careful. This minification will break JS code that is wrapped inside a CDATA wrapper. The example should be extended to exclude not only pre and textarea but additionally the CDATA blocks. – Jürgen Hörmann Feb 25 '19 at 09:08
3

I implemented the answer from @ridgerunner in two projects, and ended up hitting some severe slowdowns (10-30 second request times) in staging for one of the projects. I found out that I had to set both pcre.recursion_limit and pcre.backtrack_limit quite low for it to even work, but even then it would give up after about 2 senconds of processing and return false.

Since that, I've replaced it with this solution (with easier-to-grasp regex), which is inspired by the outputfilter.trimwhitespace function from Smarty 2. It does no backtracking or recursion, and works every time (instead of catastrophically failing once in a blue moon):

function filterHtml($input) {
    // Remove HTML comments, but not SSI
    $input = preg_replace('/<!--[^#](.*?)-->/s', '', $input);

    // The content inside these tags will be spared:
    $doNotCompressTags = ['script', 'pre', 'textarea'];
    $matches = [];

    foreach ($doNotCompressTags as $tag) {
        $regex = "!<{$tag}[^>]*?>.*?</{$tag}>!is";

        // It is assumed that this placeholder could not appear organically in your
        // output. If it can, you may have an XSS problem.
        $placeholder = "@@<'-placeholder-$tag'>@@";

        // Replace all the tags (including their content) with a placeholder, and keep their contents for later.
        $input = preg_replace_callback(
            $regex,
            function ($match) use ($tag, &$matches, $placeholder) {
                $matches[$tag][] = $match[0];
                return $placeholder;
            },
            $input
        );
    }

    // Remove whitespace (spaces, newlines and tabs)
    $input = trim(preg_replace('/[ \n\t]+/m', ' ', $input));

    // Iterate the blocks we replaced with placeholders beforehand, and replace the placeholders
    // with the original content.
    foreach ($matches as $tag => $blocks) {
        $placeholder = "@@<'-placeholder-$tag'>@@";
        $placeholderLength = strlen($placeholder);
        $position = 0;

        foreach ($blocks as $block) {
            $position = strpos($input, $placeholder, $position);
            if ($position === false) {
                throw new \RuntimeException("Found too many placeholders of type $tag in input string");
            }
            $input = substr_replace($input, $block, $position, $placeholderLength);
        }
    }

    return $input;
}
3

Sorry for not commenting, reputation missing ;)

I want to urge everybody not to implement such regex without checking for performance penalties. Shopware implemented the first regex (from Alan/ridgerunner) for their HTML minify and "blow up" every shop with bigger pages.

If possible, a combined solution (regex + some other logic) is most of the time faster and more maintainable (except you are Damian Conway) for complex problems.

Also i want to mention, that most minifier can break your code (JavaScript and HTML), when in a script-block itself is another script-block via document.write i.e.

Attached my solution (an optimized version off user2677898 snippet). I simplified the code and run some tests. Under PHP 7.2 my version was ~30% faster for my special testcase. Under PHP 7.3 and 7.4 the old variant gained much speed and is only ~10% slower. Also my version is still better maintainable due to less complex code.

function filterHtml($content) {
{
    // List of untouchable HTML-tags.
    $unchanged = 'script|pre|textarea';

    // It is assumed that this placeholder could not appear organically in your
    // output. If it can, you may have an XSS problem.
    $placeholder = "@@<'-pLaChLdR-'>@@";

    // Some helper variables.
    $unchangedBlocks  = [];
    $unchangedRegex   = "!<($unchanged)[^>]*?>.*?</\\1>!is";
    $placeholderRegex = "!$placeholder!";

    // Replace all the tags (including their content) with a placeholder, and keep their contents for later.
    $content = preg_replace_callback(
        $unchangedRegex,
        function ($match) use (&$unchangedBlocks, $placeholder) {
            array_push($unchangedBlocks, $match[0]);
            return $placeholder;
        },
        $content
    );

    // Remove HTML comments, but not SSI
    $content = preg_replace('/<!--[^#](.*?)-->/s', '', $content);

    // Remove whitespace (spaces, newlines and tabs)
    $content = trim(preg_replace('/[ \n\t]{2,}|[\n\t]/m', ' ', $content));

    // Replace the placeholders with the original content.
    $content = preg_replace_callback(
        $placeholderRegex,
        function ($match) use (&$unchangedBlocks) {
            // I am a paranoid.
            if (count($unchangedBlocks) == 0) {
                throw new \RuntimeException("Found too many placeholders in input string");
            }
            return array_shift($unchangedBlocks);
        },
        $content
    );

    return $content;
}
Sadrak
  • 81
  • 5
  • You are close. Keep this one open and change it to a comment when you can. I gave you an upvote so you only will need one more when you get another. Then change this to a comment and delete your answer that isn't an answer please. ;-) – Rodger Mar 12 '20 at 23:11