Remove white space from entire Html but inside pre with regular expressions

Question

On ASP.NET MVC 3, I created a Action Filter for white space removal from the entire html. It works as I expected most of the time but now I need to change the RegEx in order not to touch inside pre element.

I get the RegEx logic from awesome Mads Kristensen's blog and I am not sure how to modify it for this purpose.

Here is the logic:

public override void Write(byte[] buffer, int offset, int count) {

    string HTML = Encoding.UTF8.GetString(buffer, offset, count);

    Regex reg = new Regex(@"(?<=[^])\t{2,}|(?<=[>])\s{2,}(?=[<])|(?<=[>])\s{2,11}(?=[<])|(?=[\n])\s{2,}");
    HTML = reg.Replace(HTML, string.Empty);

    buffer = System.Text.Encoding.UTF8.GetBytes(HTML);
    this.Base.Write(buffer, 0, buffer.Length);
}

Whole code of the filter:

https://github.com/tugberkugurlu/MvcBloggy/blob/master/src/MvcBloggy.Web/Application/ActionFilters/RemoveWhitespacesAttribute.cs

Any idea?

EDIT:

BIG NOTE:

My intention is totally not speed up the response time. In fact, maybe this slows things down. I GZiped the pages and this minification makes me gain approx 4 - 5 kb per page which is nothing.

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — asawyer, Jan 06 '12 at 19:07
This regex is awful, whoever wrote it didn't really know what they were doing. — Qtax, Jan 06 '12 at 20:33
@tugberk, it's redundant and just wrong in places. Remove the first alteration, remove the 3rd alteration, remove all `[` and `]` characters. You will end up with an equivalent expression, but shorter, faster and cleaner. — Qtax, Jan 06 '12 at 21:09
just out of interest why are you doing this on every request? why not do it when the content is published to the site? — Peter, Jan 06 '12 at 21:37

score 5 · Accepted Answer · answered Jan 06 '12 at 21:01

5

Parsing HTML with regex very complicated and any simple solutions could break easily. (Use the right tool for the job.) That being said I'll show a simple solution.

First I simplified the regex you had to:

(?<=\s)\s+

Replace those matches with an empty string to get rid of double spaces everywhere.

Assuming there are no < or > inside the pre tag, you can add (?![^<>]*</pre>) at the end of the expression to make it fail inside of pre tags. This makes sure that </pre> doesn't follow current match, without any tags in between.

Resulting in:

(?<=\s)\s+(?![^<>]*</pre>)

answered Jan 06 '12 at 21:01

Qtax

33,241
9
83
121

This worked as I expected, thanks! I also understand that it is not the recommended way of doing this. – tugberk Jan 07 '12 at 10:58
Unfortunately this fails if there are < or > inside the pre, which could be common if the pre's are being used to display code. – RobW Apr 23 '12 at 04:15
1

@RobW, there shouldn't be any, you should encode those with `<` and `>`. – Qtax Apr 26 '12 at 08:03

score 0 · Answer 2 · edited May 23 '17 at 12:09

0

Please see the very epic RegEx match open tags except XHTML self-contained tags for all the reasons why regular expressions and HTML don't get along.

If you're using that approach above to make the page size smaller, you should definitely look into IIS compression as most browsers can take advantage of it and it'd be easier than how you're going about it. Here's how to do it in IIS 6 and IIS 7:

http://www.microsoft.com/technet/prodtechnol/WindowsServer2003/Library/IIS/502ef631-3695-4616-b268-cbe7cf1351ce.mspx?mfr=true

http://technet.microsoft.com/en-us/library/cc771003(WS.10).aspx

edited May 23 '17 at 12:09

Community

1
1

answered Jan 06 '12 at 19:11

Milimetric

13,411
4
44
56

please read the updated question. I asked about one thing, you answered about another. – tugberk Jan 06 '12 at 19:15
+1 @tugberk You said " am not sure how to modify it (the regex)", and the answer at the given link is "Don't do that." Use the right tool for the right job, and regex is *not* the tool to parse Html with. – asawyer Jan 06 '12 at 19:39
Didn't mean to start a holy war, sorry you feel I've wasted your time. I'll see if I can answer your updated question in a separate post. – Milimetric Jan 06 '12 at 19:44

score 0 · Answer 3 · answered Jan 06 '12 at 19:48

Maybe break it up into four steps:

extract any matching PRE elements using regex, something simple like "start with <pre>(anything not </pre>)* end with </pre>"
replace each of those matches with a separate GUID and save a dictionary of GUID -> pre element html.
take out whitespace (won't affect the GUIDs or their placement.
iterate through the dictionary you saved in 2. and put the pre elements back in the correct spot.

Remove white space from entire Html but inside pre with regular expressions

3 Answers3

Linked