Condition inside regex pattern

Question

I would like to remove any extra whitespace from my code, I'm parsing a docblock. The problem is that I do not want to remove whitespace within a <code>code goes here</code>.

Example, I use this to remove extra whitespace:

$string = preg_replace('/[ ]{2,}/', '', $string);

But I would like to keep whitespace within <code></code>

This code/string:

This  is some  text
  This is also   some text

<code>
User::setup(array(
    'key1' => 'value1',
    'key2' => 'value1'
));
</code>

Should be transformed into:

This is some text
This is also some text

<code>
User::setup(array(
    'key1' => 'value1',
    'key2' => 'value1'
));
</code>

How can I do this?

You might want to consider writing a simple parser for this. At least you need to distinguish lines outside from the code-block with those inside. And you can't do that with a single regexp. — poke, Mar 12 '11 at 15:11
You don’t need to use a character class for a single character; just write `/ {2,}/`. — Gumbo, Mar 12 '11 at 15:26
Regular expression allow for conditions `(?(x)y|z)`, but I have no idea how to apply that to match either line-wise or in blocks. And you are better off iterating line-wise over the source text, setting and reversing a state flag for occurences of `?code>` and applying the regex `/^\s{2,}` only then on each line. — mario, Mar 12 '11 at 16:25
@mario I was going to write that as an answer... please do that so I can upvote it :) — alex, Mar 12 '11 at 16:30

score 4 · Accepted Answer · answered Mar 13 '11 at 07:55

You aren't really looking for a condition - you need a way to skip parts of the string so they are not replaced. This can be done rather easily using preg_replace, by inserting dummy groups and replacing each group with itself. In your case you only need one:

$str = preg_replace("~(<code>.*?</code>)|^ +| +$|( ) +~smi" , "$1$2", $str);

How does it work?

(<code>.*?</code>) - Match a <code> block into the first group, $1. This assumes simple formatting and no nesting, but can be complicated if needed.
^ + - match and remove spaces on beginnings of lines.
[ ]+$ - match and remove spaces on ends of lines.
( ) + match two or more spaces in the middle of lines, and capture the first one to the second group, $2.

The replace string, $1$2 will keep <code> blocks and the first space if captured, and remove anything else it matches.

Things to remember:

If $1 or $2 didn't capture, it will be replaced with an empty string.
Alternations (a|b|c) work from left to right - when it makes a match it is satisfied, and doesn't try matching again. That is why ^ +| +$ must be before ( ) +.

Working example: http://ideone.com/HxbaV

Yes, for nor non-nested CODE blocks, this is a very good solution indeed (and it is quite fast, too). +1 — ridgerunner, Mar 13 '11 at 20:46

ridgerunner · Answer 2 · 2011-03-13T06:56:47.623

When parsing markup with PHP and regex, the preg_replace_callback() function combined with the (?R), (?1), (?2)... recursive expressions, make for a very powerful tool indeed. The following script handles your test data quite nicely:

<?php // test.php 20110312_2200

function clean_non_code(&$text) {
    $re = '%
    # Match and capture either CODE into $1 or non-CODE into $2.
      (                      # $1: CODE section (never empty).
        <code[^>]*>          # CODE opening tag
        (?R)+                # CODE contents w/nested CODE tags.
        </code\s*>           # CODE closing tag
      )                      # End $1: CODE section.
    |                        # Or...
      (                      # $2: Non-CODE section (may be empty).
        [^<]*+               # Zero or more non-< {normal*}
        (?:                  # Begin {(special normal*)*}
          (?!</?code\b)      # If not a code open or close tag,
          <                  # match non-code < {special}
          [^<]*+             # More {normal*}
        )*+                  # End {(special normal*)*}
      )                      # End $2: Non-CODE section
    %ix';

    $text = preg_replace_callback($re, '_my_callback', $text);
    if ($text === null) exit('PREG Error!\nTarget string too big.');
    return $text;
}

// The callback function is called once for each
// match found and is passed one parameter: $matches.
function _my_callback($matches)
{ // Either $1 or $2 matched, but never both.
    if ($matches[1]) {
        return $matches[1];
    }
    // Collapse multiple tabs and spaces into a single space.
    $matches[2] = preg_replace('/[ \t][ \t]++/S', ' ', $matches[2]);
    // Trim each line
    $matches[2] = preg_replace('/^ /m', '', $matches[2]);
    $matches[2] = preg_replace('/ $/m', '', $matches[2]);
    return $matches[2];
}

// Create some test data.
$data = "This  is some  text
  This is also   some text

<code>
User::setup(array(
    'key1'      => 'value1',
    'key2'      => 'value1',
    'key42'     => '<code>
        Pay no attention to this. It has been proven over and
        over again that it is <code>   unpossible   </code>
        to parse nested stuff with regex!           </code>'
));
</code>";

// Demonstrate that it works on one small test string.
echo("BEFORE:\n". $data ."\n\n");
echo("AFTER:\n". clean_non_code($data) ."\n\nTesting...");

// Build a large test string.
$bigdata = '';
for ($i =   0; $i < 30000; ++$i) $bigdata .= $data;
$size = strlen($bigdata);

// Measure how long it takes to process it.
$time = microtime(true);
$bigdata = clean_non_code($bigdata);
$time = microtime(true) - $time;

// Print benchmark results
printf("Done.\nTest string size: %d bytes. Time: %.3f sec. Speed: %.0f KB/s.\n",
    $size, $time, ($size / $time)/1024.);
?>

Here are the script benchmark results when run on my test box: WinXP32 PHP 5.2.14 (cli)

'Test string size: 10410000 bytes. Time: 1.219 sec. Speed: 8337 KB/s.'

Note that this solution does not handle CODE tags having <> angle brackets in their attributes (probably a very rare edge case), but the regex could be easily modified to handle these as well. Note also that the maximum string length will depend upon the composition of the string content (i.e. Big CODE blocks reduce the maximum input string length.)

p.s. Note to SO staff. The  doesn't work.

Very interesting. However, the OP didn't mention tabs or nesting tags, which I suspect made answer much more complicated than needed. — Kobi, Mar 13 '11 at 08:08

score 1 · Answer 3 · edited May 23 '17 at 12:11

What you will want is to parse it using some form of HTML parser.

For example, you could iterate through all elements ignoring code elements with DOMDocument and strip whitespace from their text nodes.

Alternatively, open the file with fopen() so you have an array of lines, and step through each line stripping whitespace if outside of a code element.

To determine if you are in a code element, look for the starting tag <code> and set a flag which says in code element mode. You can then skip these lines. Reset the flag when you encounter </code>. You could take into account nesting by having its state stored as an integer, even though nested code elements are not the smartest idea (why would you nest them)?

Mario came up with this before me.

score 0 · Answer 4 · edited May 23 '17 at 12:19

0

Parsing HTML with regexes is a bad idea.

RegEx match open tags except XHTML self-contained tags

Use something like Zend_DOM to parse HTML and extract parts of it you need to replace spaces in.

edited May 23 '17 at 12:19

Community

1
1

answered Mar 12 '11 at 15:18

Vladislav Rastrusny

29,378
23
95
156

3

Note that not everything where some kind of tags appear (here: ``) is HTML.. This is obviously some kind of custom formatting markup (like Markdown or something), so a full HTML/DOM parser will not work. – edit: in fact it is a code documentation block.. – poke Mar 12 '11 at 15:26

Condition inside regex pattern

4 Answers4

Linked