When parsing markup with PHP and regex, the preg_replace_callback()
function combined with the (?R), (?1), (?2)...
recursive expressions, make for a very powerful tool indeed. The following script handles your test data quite nicely:
<?php // test.php 20110312_2200
function clean_non_code(&$text) {
$re = '%
# Match and capture either CODE into $1 or non-CODE into $2.
( # $1: CODE section (never empty).
<code[^>]*> # CODE opening tag
(?R)+ # CODE contents w/nested CODE tags.
</code\s*> # CODE closing tag
) # End $1: CODE section.
| # Or...
( # $2: Non-CODE section (may be empty).
[^<]*+ # Zero or more non-< {normal*}
(?: # Begin {(special normal*)*}
(?!</?code\b) # If not a code open or close tag,
< # match non-code < {special}
[^<]*+ # More {normal*}
)*+ # End {(special normal*)*}
) # End $2: Non-CODE section
%ix';
$text = preg_replace_callback($re, '_my_callback', $text);
if ($text === null) exit('PREG Error!\nTarget string too big.');
return $text;
}
// The callback function is called once for each
// match found and is passed one parameter: $matches.
function _my_callback($matches)
{ // Either $1 or $2 matched, but never both.
if ($matches[1]) {
return $matches[1];
}
// Collapse multiple tabs and spaces into a single space.
$matches[2] = preg_replace('/[ \t][ \t]++/S', ' ', $matches[2]);
// Trim each line
$matches[2] = preg_replace('/^ /m', '', $matches[2]);
$matches[2] = preg_replace('/ $/m', '', $matches[2]);
return $matches[2];
}
// Create some test data.
$data = "This is some text
This is also some text
<code>
User::setup(array(
'key1' => 'value1',
'key2' => 'value1',
'key42' => '<code>
Pay no attention to this. It has been proven over and
over again that it is <code> unpossible </code>
to parse nested stuff with regex! </code>'
));
</code>";
// Demonstrate that it works on one small test string.
echo("BEFORE:\n". $data ."\n\n");
echo("AFTER:\n". clean_non_code($data) ."\n\nTesting...");
// Build a large test string.
$bigdata = '';
for ($i = 0; $i < 30000; ++$i) $bigdata .= $data;
$size = strlen($bigdata);
// Measure how long it takes to process it.
$time = microtime(true);
$bigdata = clean_non_code($bigdata);
$time = microtime(true) - $time;
// Print benchmark results
printf("Done.\nTest string size: %d bytes. Time: %.3f sec. Speed: %.0f KB/s.\n",
$size, $time, ($size / $time)/1024.);
?>
Here are the script benchmark results when run on my test box: WinXP32 PHP 5.2.14 (cli)
'Test string size: 10410000 bytes. Time: 1.219 sec. Speed: 8337 KB/s.'
Note that this solution does not handle CODE tags having <>
angle brackets in their attributes (probably a very rare edge case), but the regex could be easily modified to handle these as well. Note also that the maximum string length will depend upon the composition of the string content (i.e. Big CODE blocks reduce the maximum input string length.)
p.s. Note to SO staff. The <!-- language: lang-none -->
doesn't work.