Regex replace whitespace in HTML document

Question

I saw many similar question, but still not found the answer.
How should look the regex, that needs to replace all whitespaces (include newline) in HTML, but ignore the tag?

Currently I use Regex.Replace(content, @"\s+", ""); but in removes spaces in JavaScript that exists on page and than the page not works.

Thank you.

EDIT: After some question in responses, here a little bit more details: What I'm doing is HTTP module that "minifies" HTML output on our site. We have a web site with very dynamic content that came from many different sources. The final goal, is to reduce page size and reduce network traffic. It's a highly loaded web site so it's important to us to complete that.

Actually we are using MbCompression library for JS and CSS minification, but it not supports to minify HTML output (at least i didn't found).

Have a look [here](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags), a famous SO question — Jashwant, Oct 15 '12 at 13:45
Why not [GZIP](http://stackoverflow.com/questions/552317/how-to-implement-gzip-compression-in-asp-net) instead? — jrummell, Oct 15 '12 at 14:03
@jrummell We are using, but we are removing the whitespaces before the compression and in addition compression is not always supported. — Alex Dn, Oct 15 '12 at 14:12
Removing redundant whitespace before compression saves very little. It would be better to not produce it at all, but removing it after the fact when you then go ahead and gzip anyway will not save you any measurable amount. — perh, Oct 15 '12 at 15:12
@perh I agree that it's saves very little, but it is the requirement that i got from my boss. — Alex Dn, Oct 15 '12 at 15:32

score 2 · Answer 1 · edited Oct 15 '12 at 13:43

2

There is really no way to write a single (reasonable) regexp to do this. Especially not if you want to support javascript and css. You need to have a real parser.

edited Oct 15 '12 at 13:43

Michal Klouda

14,263
7
53
77

answered Oct 15 '12 at 13:38

perh

1,668
11
14

Can you advise any parser that can do it? – Alex Dn Oct 15 '12 at 13:58
http://htmlagilitypack.codeplex.com/ perhaps? Parse the HTML into a DOM tree, and then do the whitespace trimming on textnodes. – perh Oct 15 '12 at 15:44

score 1 · Answer 2 · answered Oct 15 '12 at 13:47

What's your goal? Browsers ignore a lot of whitespace when rendering pages so I'm guessing you want to clean up your source code. If so, check if the program you use offers some solution to this. For example Dreamweaver has a tool to reformat source code.

Tidy could be one option but it looks like it's a bit more than a simple code formatting tool.

score 1 · Accepted Answer · answered Oct 15 '12 at 13:48

1

If you can find a decent HTML parser, I would do it via DOM manipulation. If you can't, then something like

Regex.Replace(content, "(?i)(<script(?:[^>\"']|\"[^\"]*\"]|'[^']*')*>)\s+</script\\s*>|<style(?:[^>\"']|\"[^\"]*\"]|'[^']*')*>)\s+</style\\s*>|<textarea(?:[^>\"']|\"[^\"]*\"]|'[^']*')*>)\s+</textarea\\s*>|</?[a-z](?:[^>\"']|\"[^\"]*\"]|'[^']*')*>|\\S+)|\\s+", "$1");

should do it. It will not remove spaces inside tags or inside embedded JS, CSS, or inside textareas but will remove newlines in text nodes.

answered Oct 15 '12 at 13:48

Mike Samuel

118,113
30
216
245

As I'm thinking now, we also use HtmlDocument from AgilityPack. Do you know if it supports such option? – Alex Dn Oct 15 '12 at 14:00
@AlexDn, http://stackoverflow.com/questions/846994/how-to-use-html-agility-pack suggests that `htmlDoc.DocumentNode.SelectSingleNode("//body")` will get you the body, and then you can traverse that to find all text nodes not inside ` – Mike Samuel Oct 15 '12 at 19:14
Ok, thanks, looks like I will use the solution with HtmlDocument traverse. – Alex Dn Oct 16 '12 at 07:31

score 0 · Answer 4 · edited Oct 15 '12 at 13:45

0

Regex.Replace(document.body.innerHTML, @"\s+", "");

using document.body.innerHTML instead may work. I am not sure.

edited Oct 15 '12 at 13:45

Jashwant

28,410
16
70
105

answered Oct 15 '12 at 13:40

mmuratusta

100
9

I need it in C# (server side) – Alex Dn Oct 15 '12 at 13:59

score 0 · Answer 5 · answered Oct 15 '12 at 13:41

Surely you should be replacing it with a space at least, not just removing whitespace entirely. For HTML that should be fine but if you are talking about having strings in javascript with multiple spaces not being collapsed then you need to think of another method since regular expressions won't work out easily whether you are in script, in a string, etc.

That having been said I'm not sure of a good reason to do this. If you are worried about the size of the file then just tell your server to use compression which I suspect by now every browser supports well enough and the pages will basically be zipped by the server and unzipped on the client. Its a bit more work for the server so it depends if you care about bandwidth or CPU more.

score 0 · Answer 6 · answered Oct 15 '12 at 16:40

Regex.Replace(html, "\s*(<[^>]+>)\s*", "$1", RegexOptions.SingleLine);

There are risks related to tags, unclosed tags etc. I hope you have some control over the 'dynamic content that comes from different sources' as you've put it. I also hope that you've tried everything else and this comes as a last resort.

Regex replace whitespace in HTML document

6 Answers6

Linked