0

I saw many similar question, but still not found the answer.
How should look the regex, that needs to replace all whitespaces (include newline) in HTML, but ignore the tag?

Currently I use Regex.Replace(content, @"\s+", ""); but in removes spaces in JavaScript that exists on page and than the page not works.

Thank you.

EDIT: After some question in responses, here a little bit more details: What I'm doing is HTTP module that "minifies" HTML output on our site. We have a web site with very dynamic content that came from many different sources. The final goal, is to reduce page size and reduce network traffic. It's a highly loaded web site so it's important to us to complete that.

Actually we are using MbCompression library for JS and CSS minification, but it not supports to minify HTML output (at least i didn't found).

Alex Dn
  • 5,465
  • 7
  • 41
  • 79
  • Are you asking about JavaScript, or C#? – Mike Samuel Oct 15 '12 at 13:42
  • Have a look [here](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags), a famous SO question – Jashwant Oct 15 '12 at 13:45
  • Why not [GZIP](http://stackoverflow.com/questions/552317/how-to-implement-gzip-compression-in-asp-net) instead? – jrummell Oct 15 '12 at 14:03
  • @jrummell We are using, but we are removing the whitespaces before the compression and in addition compression is not always supported. – Alex Dn Oct 15 '12 at 14:12
  • Removing redundant whitespace before compression saves very little. It would be better to not produce it at all, but removing it after the fact when you then go ahead and gzip anyway will not save you any measurable amount. – perh Oct 15 '12 at 15:12
  • @perh I agree that it's saves very little, but it is the requirement that i got from my boss. – Alex Dn Oct 15 '12 at 15:32

6 Answers6

2

There is really no way to write a single (reasonable) regexp to do this. Especially not if you want to support javascript and css. You need to have a real parser.

Michal Klouda
  • 14,263
  • 7
  • 53
  • 77
perh
  • 1,668
  • 11
  • 14
1

What's your goal? Browsers ignore a lot of whitespace when rendering pages so I'm guessing you want to clean up your source code. If so, check if the program you use offers some solution to this. For example Dreamweaver has a tool to reformat source code.

Tidy could be one option but it looks like it's a bit more than a simple code formatting tool.

ZZ-bb
  • 2,157
  • 1
  • 24
  • 33
1

If you can find a decent HTML parser, I would do it via DOM manipulation. If you can't, then something like

Regex.Replace(content, "(?i)(<script(?:[^>\"']|\"[^\"]*\"]|'[^']*')*>)\s+</script\\s*>|<style(?:[^>\"']|\"[^\"]*\"]|'[^']*')*>)\s+</style\\s*>|<textarea(?:[^>\"']|\"[^\"]*\"]|'[^']*')*>)\s+</textarea\\s*>|</?[a-z](?:[^>\"']|\"[^\"]*\"]|'[^']*')*>|\\S+)|\\s+", "$1");

should do it. It will not remove spaces inside tags or inside embedded JS, CSS, or inside textareas but will remove newlines in text nodes.

Mike Samuel
  • 118,113
  • 30
  • 216
  • 245
  • As I'm thinking now, we also use HtmlDocument from AgilityPack. Do you know if it supports such option? – Alex Dn Oct 15 '12 at 14:00
  • @AlexDn, http://stackoverflow.com/questions/846994/how-to-use-html-agility-pack suggests that `htmlDoc.DocumentNode.SelectSingleNode("//body")` will get you the body, and then you can traverse that to find all text nodes not inside ` – Mike Samuel Oct 15 '12 at 19:14
  • Ok, thanks, looks like I will use the solution with HtmlDocument traverse. – Alex Dn Oct 16 '12 at 07:31
0
Regex.Replace(document.body.innerHTML, @"\s+", "");

using document.body.innerHTML instead may work. I am not sure.

Jashwant
  • 28,410
  • 16
  • 70
  • 105
mmuratusta
  • 100
  • 9
0

Surely you should be replacing it with a space at least, not just removing whitespace entirely. For HTML that should be fine but if you are talking about having strings in javascript with multiple spaces not being collapsed then you need to think of another method since regular expressions won't work out easily whether you are in script, in a string, etc.

That having been said I'm not sure of a good reason to do this. If you are worried about the size of the file then just tell your server to use compression which I suspect by now every browser supports well enough and the pages will basically be zipped by the server and unzipped on the client. Its a bit more work for the server so it depends if you care about bandwidth or CPU more.

Chris
  • 27,210
  • 6
  • 71
  • 92
0
Regex.Replace(html, "\s*(<[^>]+>)\s*", "$1", RegexOptions.SingleLine);

There are risks related to tags, unclosed tags etc. I hope you have some control over the 'dynamic content that comes from different sources' as you've put it. I also hope that you've tried everything else and this comes as a last resort.

sainiuc
  • 1,697
  • 11
  • 13