Minify HTML with Boost regex in C++

Question

Question

How to minify HTML using C++?

Resources

An external library could be the answer, but I'm more looking for improvements of my current code. Although I'm all ears for other possibilities.

Current code

This is my interpretation in c++ of the following answer.

The only part I had to change from the original post is this part on top: "(?ix)"
...and a few escape signs

#include <boost/regex.hpp>
void minifyhtml(string* s) {
  boost::regex nowhitespace(
    "(?ix)"
    "(?>"           // Match all whitespans other than single space.
    "[^\\S ]\\s*"   // Either one [\t\r\n\f\v] and zero or more ws,
    "| \\s{2,}"     // or two or more consecutive-any-whitespace.
    ")"             // Note: The remaining regex consumes no text at all...
    "(?="           // Ensure we are not in a blacklist tag.
    "[^<]*+"        // Either zero or more non-"<" {normal*}
    "(?:"           // Begin {(special normal*)*} construct
    "<"             // or a < starting a non-blacklist tag.
    "(?!/?(?:textarea|pre|script)\\b)"
    "[^<]*+"        // more non-"<" {normal*}
    ")*+"           // Finish "unrolling-the-loop"
    "(?:"           // Begin alternation group.
    "<"             // Either a blacklist start tag.
    "(?>textarea|pre|script)\\b"
    "| \\z"         // or end of file.
    ")"             // End alternation group.
    ")"             // If we made it here, we are not in a blacklist tag.
  );
  
  // @todo Don't remove conditional html comments
  boost::regex nocomments("<!--(.*)-->");
  
  *s = boost::regex_replace(*s, nowhitespace, " ");
  *s = boost::regex_replace(*s, nocomments, "");
}

Only the first regex is from the original post, the other one is something I'm working on and should be considered far from complete. It should hopefully give a good idea of what I try to accomplish though.

There is no such thing as minifying HTML. Every single whitespace character is potentially meaningful, such as within a `` or `<pre>` or if the container has `white-space:pre-wrap`. Add in the fact that JavaScript can change this on the fly, and you have absolutely no way of knowing what should be kept and what can be safely removed. At least, not automatically. Manually, sure, you can minify your HTML.</pre> — Niet the Dark Absol, Apr 21 '13 at 18:07
@Kolink I knew someone would tell me this :D I'm writing the code though, so I have full awareness of the restrictions it applies. — superhero, Apr 21 '13 at 18:17
Removing the space in “`> <`” isn’t only an error in textarea etc., it also affects the layout in other code (essentially whenever inline tags are involved). If you *really* want to minify HTML, use a proper HTML parser, parse the input properly and write it back out. — Konrad Rudolph, Apr 21 '13 at 18:41
@KonradRudolph God point on the inline elements, will remove that part then :) — superhero, Apr 21 '13 at 18:44
*Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems* ― attributed to jwz — n. m. could be an AI, Jun 12 '13 at 05:34
Does someone by accident also has a non-PCRE version of this regexp that works in JavaScript? — Mark Knol, Sep 13 '18 at 11:53

score 1 · Accepted Answer · answered Jun 12 '13 at 05:53

Regexps are a powerful tool, but I think that using them in this case will be a bad idea. For example, regexp you provided is maintenance nightmare. By looking at this regexp you can't quickly understand what the heck it is supposed to match.

You need a html parser that would tokenize input file, or allow you to access tokens either as a stream or as an object tree. Basically read tokens, discards those tokens and attributes you don't need, then write what remains into output. Using something like this would allow you to develop solution faster than if you tried to tackle it using regexps.

I think you might be able to use xml parser or you could search for xml parser with html support.

In C++, libxml (which might have HTML support module), Qt 4, tinyxml, plus libstrophe uses some kind of xml parser that could work.

Please note that C++ (especially C++03) might not be the best language for this kind of program. Although I strongly dislike python, python has "Beautiful Soup" module that would work very well for this kind of problem.

Qt 4 might work because it provides decent unicode string type (and you'll need it if you're going to parse html).

Minify HTML with Boost regex in C++

Question

Resources

Current code

1 Answers1

Linked