Regex to match whitespace but skip sections

Question

I understand since Regex is essentially stateless, it's rather difficult to achieve complicated matches without resorting to supplementing application logic, however I'm curious to know if the following is possible.

Match all whitespace, easy enough: \s+

But skip whitespace between certain delimiters, in my case ~~<pre> and </pre>~~ the word nostrip.

Are there any tricks to achieve this? I was thinking along the lines of two separate matches, one for all whitespace, and one for ~~<pre> blocks~~ nostrip sections, and somehow negating the latter from the former.

"This is some text NOSTRIP this is more text NOSTRIP some more text."
// becomes
"ThisissometextNOSTRIP this is more text NOSTRIPsomemoretext."

The nesting of given ~~tags~~ nostrip sections is irrelevant, and I'm not trying to parse ~~the tree~~ HTML or anything, just tidying a text file, but saving the whitespace in ~~<pre> blocks~~ nostrip sections for obvious reasons.

(better?)

This is ultimately what I went with. I'm sure it can be optimized in a few places, but it works nicely for now.

public function stripWhitespace($html, Array $skipTags = array('pre')){
    foreach($skipTags as &$tag){
        $tag = "<{$tag}.*?/{$tag}>";
    }
    $skipped = array();
    $buffer = preg_replace_callback('#(?<tag>' . implode('|', $skipTags) . ')#si',
        function($match) use(&$skipped){
            $skipped[] = $match['tag'];
            return "\x1D" . (count($skipped) - 1) . "\x1D";
        }, $html
    );
    $buffer = preg_replace('#\s+#si', ' ', $buffer);
    $buffer = preg_replace('#(?:(?<=>)\s|\s(?=<))#si', '', $buffer);
    for($i = count($skipped) - 1; $i >= 0; $i--){
        $buffer = str_replace("\x1D{$i}\x1D", $skipped[$i], $buffer);
    }
    return $buffer;
}

What you'd need is in fact even more complicated: the regex would also need to ensure that there's no between the
and the whitespace, and vice versa. — abesto, May 12 '11 at 20:54
http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — mellamokb, May 12 '11 at 21:02
*sigh*; Been there, seen the answer, et cetera. Regex for structured document parsing, **OH NO!** Well, I made the choice of regex to strip whitespace as a responsible developer. I could have just as easily said I want to strip all the whitespace from a text file except whitspace between the words `foo` and `bar`. In fact... — Dan Lugg, May 12 '11 at 21:22
This one might be more related: [Why minify assets and not the markup?](http://stackoverflow.com/questions/1306792/why-minify-assets-and-not-the-markup) — Kobi, May 12 '11 at 22:08

score 2 · Accepted Answer · answered May 12 '11 at 21:41

2

I you are using a scripting language, I would use a multi-step approach.

pull out the NOSTRIP sections, and save to an array, and replace with markers (### or something)
replace all the spaces
re-inject all your saved NOSTRIP snippets

answered May 12 '11 at 21:41

Matt

685
1
8
16

Thanks **Matt**; That's the direction I was sort of heading, I was just curious about ways this could be achieved without multiple steps. Also, yes **PHP**. I was hoping for something along the lines of a way to "turn off" the regex parsing when it hits a `nostrip` tag, and then turn it back on when it hits another. – Dan Lugg May 12 '11 at 21:46
Also, what would be a safe character/characters to use as temporary delimiters? (*read; what do you/others you know/standard conventions use?*) I was thinking perhaps an obscure control character, like `BEL` – Dan Lugg May 12 '11 at 21:54
I always find myself using regex in one-off situations, so it's easier to figure out a unique string for the file. Something like "~~~" usually works. But as you suggest there isn't a foolproof string. You can only mitigate risk with more complicated strings. Try: ##~!!~!##((__# – Matt May 12 '11 at 22:52
I went with your answer, as it was easiest for the specifics. Implementation is in my edit above. – Dan Lugg May 13 '11 at 07:39

score 1 · Answer 2 · answered May 12 '11 at 21:46

I once created a set of functions to reduce white space in html outputs:

function minify($html) {
        if(empty($html)) {
                return $html;
        }
        $html = preg_replace('/^(.*)((<pre.*<\/pre>)(.*?))?$/Ues', "parse('$1').'$3'.minify('$4')", $html);
        return $html;
}

function parse($html) {
        var_dump('1'.$html);
        // Replace multiple spaces with a single space
        $html = preg_replace('/(\s+)/m', ' ', $html);
        // Remove spaces that are followed by either > or <
        $html = preg_replace('/ ([<>])/', '$1', $html);
        $html = str_replace('> ', '>', $html);
        return $html;
}

$html = minify($html);

You'll probably have to modify this slightly to fit your needs.

Thanks **Arjan**; I'll give it a shot shortly, trying out a few things. — Dan Lugg, May 12 '11 at 21:52

Regex to match whitespace but skip sections

2 Answers2