Strip unnecessary whitespace - "unnecessary" being key

Question

In an effort to reduce bandwidth, I am trying to strip out unnecessary whitespace. By "unnecessary", I am referring to any vertical whitespace, and horizontal whitespace at the start or end of lines, but not if it is in a <textarea> tag.

While I am no stranger to The Pony He Comes, I'm fairly sure a full HTML parser would be overkill for this task. By my understanding, a regex could work.

The regex I have right now is:

$out = preg_replace("/[ \t]*\r?\n[ \t]*/","",$in);

This seems to strip out the whitespace I specify above, except for the <textarea> rule. My question boils down to: How can I make sure that replacements do not happen within specified boundaries? It can be safely assumed that all HTML entities are properly escaped inside <textarea>s.

@minitech I appreciate that there are many edge cases, hence The Pony He Comes. However since I am in full control of my HTML I can easiy ensure that any such `white-space` elements have a class that could be picked up. — Niet the Dark Absol, Aug 28 '12 at 03:31

Ariel · Accepted Answer · 2012-08-23T00:49:06.587

2

If you have the html:

<P>a
b</P>

And you strip the vertical whitespace you will end up with ab instead of a b. So you would need to convert it to a space (which is pointless).

Only stripping near a tag would not help either since you could have (for example) two SPAN tags near each other.

Whitespace at the start or end of the line you could strip - but only because you already have vertical whitespace.

So if you really wanted to do this you could collapse multiple occurrences of whitespace to a single space.

If you avoided javascript, input fields, pre's, and textareas you should be OK. But without a full parser it's impossible to actually avoid those! For example someone could put a <TEXTAREA> inside a comment, and without a parser you would keep looking for the end of the textarea and never find it.

But worse is the value attribute of input. You don't want to mess with that - but it's completely impossible to even find it without a parser:

<INPUT name="value='hello'" value='name="hi"'>

The color coding makes it clear what the attributes are, but try finding them without a parser.

Avoiding the inside of tags doesn't help either since you can legally put > inside a comment.

edited Aug 23 '12 at 00:49

answered Aug 23 '12 at 00:43

Ariel

25,995
5
59
69

I specifically only want to keep whitespace that is deliberately placed there and not for organisation purposes. If I want `a b` then I will type `a b` and not `ab`. I don't intend this to be a catch-all solution, just one that will work as long as I am aware of the rules. – Niet the Dark Absol Aug 23 '12 at 00:47
What if you typed: ` newline `? The newline (which is certainly common to separate tags, so you can't say you will avoid it), is also acting as a space between those. Actually - how would you avoid the space being stripped? You would have to put them both on the same line, there is no other way to avoid the space being stripped. So you'll end up with poorly formatted code just to handle this. – Ariel Aug 23 '12 at 00:51
There's a few options. Putting them on the same line is one of them, but having two inputs side-by-side is usually bad design anyway. – Niet the Dark Absol Aug 23 '12 at 00:53
So for every problem with your regex you will answer "I won't do that"? What if you have two `` tags near each other? Do I have to keep giving example of things you might do, or can you admit that this won't work except in very limited cases. If you must do this then at least have it running during development too, not just production. That way you will quickly find out when you write something that messes it up (and you inevitably will). – Ariel Aug 23 '12 at 01:02
I know what my HTML is/will be like, and the rules I have defined above are properly thought-out so as to not cause any conflict. Do you think you could just answer the question instead of treating me like a moron? – Niet the Dark Absol Aug 23 '12 at 01:05
I did answer your question. Your question was how to avoid the inside of `` and the answer is you can't. Regex is terrible at trying to match "pairs". And regex is even worse at "not", i.e. match xzy but not abc. – Ariel Aug 23 '12 at 01:13

Strip unnecessary whitespace - "unnecessary" being key

1 Answers1