1

I have a Regex based whitespace filter on an ASP.NET MVC application, and it works perfectly, too perfectly. One of the things that gets filtered are the \r\n characters. This effectively makes everything in one line of source code, which I love because I don't have to deal with quirky CSS because of the whitespace, but in certain instances I need to retain them. One example is when I want to literraly display text with line breaks in it, such as a note.

To do so, I would obviously wrap it in <pre></pre> tags, but because of the filter the linebreaks of text in between the tags also gets scrubbed, so it makes a note for example rather difficult to read.

Can anyone with Regex knowledge (mine is very poor...) help me in modifying the current Regex to ignore text between the <pre> tags?

Here's the current code:

public class WhitespaceFilter : MemoryStream {
    private string Source = string.Empty;
    private Stream Filter = null;

    public WhitespaceFilter(HttpResponseBase HttpResponseBase) {
        Filter = HttpResponseBase.Filter;
    }

    public override void Write(byte[] buffer, int offset, int count) {
        Source = UTF8Encoding.UTF8.GetString(buffer);

        Source = new Regex("\\t", RegexOptions.Compiled | RegexOptions.Multiline).Replace(Source, string.Empty);
        Source = new Regex(">\\r\\n<", RegexOptions.Compiled | RegexOptions.Multiline).Replace(Source, "><");
        Source = new Regex("\\r\\n", RegexOptions.Compiled | RegexOptions.Multiline).Replace(Source, string.Empty);

        while (new Regex("  ", RegexOptions.Compiled | RegexOptions.Multiline).IsMatch(Source)) {
            Source = new Regex("  ", RegexOptions.Compiled | RegexOptions.Multiline).Replace(Source, string.Empty);
        };

        Source = new Regex(">\\s<", RegexOptions.Compiled | RegexOptions.Multiline).Replace(Source, "><");
        Source = new Regex("<!--.*?-->", RegexOptions.Compiled | RegexOptions.Singleline).Replace(Source, string.Empty);

        Filter.Write(UTF8Encoding.UTF8.GetBytes(Source), offset, UTF8Encoding.UTF8.GetByteCount(Source));
    }
}

Thanks in advance!

Gup3rSuR4c
  • 9,145
  • 10
  • 68
  • 126
  • Why not simply send the response back compressed? All modern browser support that anyhow, and HTML compresses really well. This looks to me like premature optimisation. Unless there is another rationale behind this. – exhuma Oct 16 '09 at 22:45
  • Yes, true, but there's a flaw, IIS 6 doesn't play nice with MVC and thus dynamic pages (pretty much all of them) don't get compressed. Second, which is very big to me, is with the whitespace stripped, I don't have to deal with "bugs" that CSS will have with handling whitespace. For example, all whitespace will act as a single space thus paddings, margins and what not will be affected... – Gup3rSuR4c Oct 16 '09 at 23:03

1 Answers1

4

There are tools like htmlcompressor already out there to strip whitespace. And like exhuma said, if this is for web optimization then gzip compression would help more than anything if you configured it on the web server.

As for your original question, there a lot of different ways to do this. You could also attack the problem with something like XPATH (if the HTML is valid XHTML) and then combine that with regex. But I figured I'd try my hand at writing a single regex to do it:

(<pre>[^<>]*(((?<Open><)[^<>]*)+((?<Close-Open>>)[^<>]*)+)*(?(Open)(?!))</pre>)|[\n\r]

It seems to work for me. Fortunately .NET has an extremely powerful regex engine including a very cool balanced matching feature. I can't explain it any better than Ryan Byington can. But the idea is to match the beginning and ending pre tags first and make sure everything inside is untouched. Then everything around those pre tags gets the rest of the regex applied, "[\n\r]".

To make this work you'd simply do this:

Source = new Regex("(<pre>[^<>]*(((?<Open><)[^<>]*)+((?<Close-Open>>)[^<>]*)+)*(?(Open)(?!))</pre>)|[\n\r]", 
  RegexOptions.Compiled | RegexOptions.Singleline).Replace(Source, "$1");

Note the $1 at the end. This is the part that grabs the results from inside the pre tags and returns them untouched.

Then after that write another line to replace \s\s+ with a single space. I think that should work pretty well.

Sedat Kapanoglu
  • 46,641
  • 25
  • 114
  • 148
Steve Wortham
  • 21,740
  • 5
  • 68
  • 90
  • Holy awesomeness, you are awesome! I really have no idea what that says except bits and pieces here and there, but you essentially just took 5 of my Regexes and put them in one and did the `
    ` fix. You're awesome! Thanks for the assist!
    – Gup3rSuR4c Oct 16 '09 at 23:17
  • You're welcome. ;) Most people will tell you that regular expressions suck for parsing HTML because of its nested nature, and they can be right. But I just learned about the balanced matching feature in the .NET regex engine recently. Powerful stuff, that. – Steve Wortham Oct 16 '09 at 23:21