0

I have a HTML file which has a lot of whitespace. My question is, is it worth removing this whitespace in order to reduce file size before I send it? If so, what would be the quickest way to remove the whitespace?

Currently this is all in C#.

Due to my comment below not working properly, I've done it here:

<html>
   <head>
       <title>test title</title>
   </head>
</html>

It is the spacing before the opening tags that I'm wanting to remove, if it's worth it.

Colin Pickard
  • 45,724
  • 13
  • 98
  • 148
Neil Knight
  • 47,437
  • 25
  • 129
  • 188

4 Answers4

1

If it is really quite a lot of white space, removing it will be good - you end up trasmitting less over the wire.

Assuming this is mostly spaces, tabs and carriage returns, I would use a regular expression and the replace with a space:

RegEx reg = new RegEx("\s");
string result = reg.Repalce(myHTML, " ");

This also assumes you are in control of the input HTML, as you shouldn't use regular expressions for parsing HTML.

Oded
  • 489,969
  • 99
  • 883
  • 1,009
  • Why should I not use regular expressions on HTML? – Neil Knight Feb 19 '10 at 16:33
  • I didn't say you shouldn't use them, I said you shouldn't _parse_ html with them. See this classic SO answer for details: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 – Oded Feb 19 '10 at 16:39
0

You mean &nbsp;?
If yes so use the string.Replace function

Adir
  • 1,423
  • 3
  • 19
  • 32
  • I just meant whitespace in general. It would look something like: etiojhtat but I'm wondering if removing the leading spaces if worth it? – Neil Knight Feb 19 '10 at 16:29
0

I guess you mean removing the tabs and spaces on the beginning of each row. You can use regular expressions for this. Check http://www.regular-expressions.info/examples.html for a example (Under 'Trimming Whitespace')

Before you do this, I would check if there is really a big difference in file-size.

Pbirkoff
  • 4,642
  • 2
  • 20
  • 18
  • Unless I do it, unfortunately I won't know. The initial file comes from a HTML editor so it's formatted so web developers can read it clearly. – Neil Knight Feb 19 '10 at 16:34
  • The example you linked to is trimming whitespace in a single line. – Oded Feb 19 '10 at 16:35
  • can you copy the HTML to a text-editor? This way you could save it as an HTML-file. Then create a copy, and use the replace-function in the text-editor, to delete the whitespaces. Then compare the file-sizes. – Pbirkoff Feb 19 '10 at 17:07
  • I'll give that a go to see if the exercise is worth it. Thanks for that suggestion. – Neil Knight Feb 20 '10 at 07:17
0

It's not worth the trouble. You are basicly ruining any formating that the file may have. That formating may be desired.

The first time you have to debug the file, when someone sits and reformats it to read the thing, you'll have just wasted any time you saved.

You will have wasted the money it costs for someone to spend 30 minutes formating the thing to read.

You will also be wasting your time creating a potentially buggy step that may accidentially remove valid spacing, because using regex for html is not reliable.

What will you gain? a few spaces and newlines removed?

Brian Leahy
  • 34,677
  • 12
  • 45
  • 60
  • We are only stripping out the whitespace so we can reduce email size. We aren't going to be saving the document back to disk. – Neil Knight Feb 20 '10 at 07:17