3

I would like to remove any extra whitespace from my code, I'm parsing a docblock. The problem is that I do not want to remove whitespace within a <code>code goes here</code>.

Example, I use this to remove extra whitespace:

$string = preg_replace('/[ ]{2,}/', '', $string);

But I would like to keep whitespace within <code></code>

This code/string:

This  is some  text
  This is also   some text

<code>
User::setup(array(
    'key1' => 'value1',
    'key2' => 'value1'
));
</code>

Should be transformed into:

This is some text
This is also some text

<code>
User::setup(array(
    'key1' => 'value1',
    'key2' => 'value1'
));
</code>

How can I do this?

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
sandelius
  • 523
  • 1
  • 6
  • 14
  • 1
    You might want to consider writing a simple parser for this. At least you need to distinguish lines outside from the code-block with those inside. And you can't do that with a single regexp. – poke Mar 12 '11 at 15:11
  • 1
    You don’t need to use a character class for a single character; just write `/ {2,}/`. – Gumbo Mar 12 '11 at 15:26
  • 2
    Regular expression allow for conditions `(?(x)y|z)`, but I have no idea how to apply that to match either line-wise or in blocks. And you are better off iterating line-wise over the source text, setting and reversing a state flag for occurences of `?code>` and applying the regex `/^\s{2,}` only then on each line. – mario Mar 12 '11 at 16:25
  • @mario I was going to write that as an answer... please do that so I can upvote it :) – alex Mar 12 '11 at 16:30
  • @alex: Too lazy. You go and write it and I vote it up! :P – mario Mar 12 '11 at 16:31

4 Answers4

4

You aren't really looking for a condition - you need a way to skip parts of the string so they are not replaced. This can be done rather easily using preg_replace, by inserting dummy groups and replacing each group with itself. In your case you only need one:

$str = preg_replace("~(<code>.*?</code>)|^ +| +$|( ) +~smi" , "$1$2", $str);

How does it work?

  • (<code>.*?</code>) - Match a <code> block into the first group, $1. This assumes simple formatting and no nesting, but can be complicated if needed.
  • ^ + - match and remove spaces on beginnings of lines.
  • [ ]+$ - match and remove spaces on ends of lines.
  • ( ) + match two or more spaces in the middle of lines, and capture the first one to the second group, $2.

The replace string, $1$2 will keep <code> blocks and the first space if captured, and remove anything else it matches.

Things to remember:

  • If $1 or $2 didn't capture, it will be replaced with an empty string.
  • Alternations (a|b|c) work from left to right - when it makes a match it is satisfied, and doesn't try matching again. That is why ^ +| +$ must be before ( ) +.

Working example: http://ideone.com/HxbaV

Kobi
  • 135,331
  • 41
  • 252
  • 292
2

When parsing markup with PHP and regex, the preg_replace_callback() function combined with the (?R), (?1), (?2)... recursive expressions, make for a very powerful tool indeed. The following script handles your test data quite nicely:

<?php // test.php 20110312_2200

function clean_non_code(&$text) {
    $re = '%
    # Match and capture either CODE into $1 or non-CODE into $2.
      (                      # $1: CODE section (never empty).
        <code[^>]*>          # CODE opening tag
        (?R)+                # CODE contents w/nested CODE tags.
        </code\s*>           # CODE closing tag
      )                      # End $1: CODE section.
    |                        # Or...
      (                      # $2: Non-CODE section (may be empty).
        [^<]*+               # Zero or more non-< {normal*}
        (?:                  # Begin {(special normal*)*}
          (?!</?code\b)      # If not a code open or close tag,
          <                  # match non-code < {special}
          [^<]*+             # More {normal*}
        )*+                  # End {(special normal*)*}
      )                      # End $2: Non-CODE section
    %ix';

    $text = preg_replace_callback($re, '_my_callback', $text);
    if ($text === null) exit('PREG Error!\nTarget string too big.');
    return $text;
}

// The callback function is called once for each
// match found and is passed one parameter: $matches.
function _my_callback($matches)
{ // Either $1 or $2 matched, but never both.
    if ($matches[1]) {
        return $matches[1];
    }
    // Collapse multiple tabs and spaces into a single space.
    $matches[2] = preg_replace('/[ \t][ \t]++/S', ' ', $matches[2]);
    // Trim each line
    $matches[2] = preg_replace('/^ /m', '', $matches[2]);
    $matches[2] = preg_replace('/ $/m', '', $matches[2]);
    return $matches[2];
}

// Create some test data.
$data = "This  is some  text
  This is also   some text

<code>
User::setup(array(
    'key1'      => 'value1',
    'key2'      => 'value1',
    'key42'     => '<code>
        Pay no attention to this. It has been proven over and
        over again that it is <code>   unpossible   </code>
        to parse nested stuff with regex!           </code>'
));
</code>";

// Demonstrate that it works on one small test string.
echo("BEFORE:\n". $data ."\n\n");
echo("AFTER:\n". clean_non_code($data) ."\n\nTesting...");

// Build a large test string.
$bigdata = '';
for ($i =   0; $i < 30000; ++$i) $bigdata .= $data;
$size = strlen($bigdata);

// Measure how long it takes to process it.
$time = microtime(true);
$bigdata = clean_non_code($bigdata);
$time = microtime(true) - $time;

// Print benchmark results
printf("Done.\nTest string size: %d bytes. Time: %.3f sec. Speed: %.0f KB/s.\n",
    $size, $time, ($size / $time)/1024.);
?>

Here are the script benchmark results when run on my test box: WinXP32 PHP 5.2.14 (cli)

'Test string size: 10410000 bytes. Time: 1.219 sec. Speed: 8337 KB/s.'

Note that this solution does not handle CODE tags having <> angle brackets in their attributes (probably a very rare edge case), but the regex could be easily modified to handle these as well. Note also that the maximum string length will depend upon the composition of the string content (i.e. Big CODE blocks reduce the maximum input string length.)

p.s. Note to SO staff. The <!-- language: lang-none --> doesn't work.

ridgerunner
  • 33,777
  • 5
  • 57
  • 69
  • Very interesting. However, the OP didn't mention tabs or nesting tags, which I suspect made answer much more complicated than needed. – Kobi Mar 13 '11 at 08:08
1

What you will want is to parse it using some form of HTML parser.

For example, you could iterate through all elements ignoring code elements with DOMDocument and strip whitespace from their text nodes.

Alternatively, open the file with fopen() so you have an array of lines, and step through each line stripping whitespace if outside of a code element.

To determine if you are in a code element, look for the starting tag <code> and set a flag which says in code element mode. You can then skip these lines. Reset the flag when you encounter </code>. You could take into account nesting by having its state stored as an integer, even though nested code elements are not the smartest idea (why would you nest them)?

Mario came up with this before me.

Community
  • 1
  • 1
alex
  • 479,566
  • 201
  • 878
  • 984
0

Parsing HTML with regexes is a bad idea.

RegEx match open tags except XHTML self-contained tags

Use something like Zend_DOM to parse HTML and extract parts of it you need to replace spaces in.

Community
  • 1
  • 1
Vladislav Rastrusny
  • 29,378
  • 23
  • 95
  • 156
  • 3
    Note that not everything where some kind of tags appear (here: ``) is HTML.. This is obviously some kind of custom formatting markup (like Markdown or something), so a full HTML/DOM parser will not work. – edit: in fact it is a code documentation block.. – poke Mar 12 '11 at 15:26