0

How are you? I'll get straight to the point.

I'm using a recursive regular expression that basically removes individual or nested <blockquote> tags. I only need to remove plain <blockquote> ... </blockquote> text, nested or not, and leave whatever is outside of these.

This regex does the job EXACTLY as I want (note the use of lookahead and recursion)

$comment=preg_replace('#<blockquote>((?!(</?blockquote>)).|(?R))*</blockquote>#s',"",$comment);

but it has a big problem: when the $comment is large (more than 3500 characters long), apache crashes (I assume segmentation fault).

I need a solution to the problem, either but solving the crash, using a better regexp or a custom function that will do the job as well.

If you simply have ideas on how to remove nested specific tags, they are kindly welcome.

Thank you in advance

Dandy
  • 303
  • 5
  • 14
  • 3
    Please refrain from parsing HTML with RegEx as it will [drive you į̷̷͚̤̤̖̱̦͍͗̒̈̅̄̎n̨͖͓̹͍͎͔͈̝̲͐ͪ͛̃̄͛ṣ̷̵̞̦ͤ̅̉̋ͪ͑͛ͥ͜a̷̘͖̮͔͎͛̇̏̒͆̆͘n͇͔̤̼͙̩͖̭ͤ͋̉͌͟eͥ͒͆ͧͨ̽͞҉̹͍̳̻͢](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454). Use an [HTML parser](http://stackoverflow.com/questions/292926/robust-mature-html-parser-for-php) instead. – Madara's Ghost Aug 10 '12 at 23:29
  • Segfault = [stack overflow](http://en.wikipedia.org/wiki/Stack_overflow) = probably infinite recursion. Are you sure it is simply the size of the string, or is it the content of it that is causing the problem? Although @Truth speaks the *truth* (ha!) - an HTML parser is a much better tool for this job. If you must persist with regex and you are certain the expression suits your needs, try throwing an `S` (study) flag on it, I have seen it fix a multitude of sins. – DaveRandom Aug 11 '12 at 00:00
  • @Truth Congrats, you seem to have found a way to break to SO CSS with that comment. Skillz... – DaveRandom Aug 11 '12 at 00:01
  • @Dandy [Here](http://codepad.viper-7.com/2lFcsr) is an example of using an HTML parser to strip blockquotes from your string. – DaveRandom Aug 11 '12 at 00:19
  • Thank you all. You were all helpful, and I went for the @cleong solution. I'm really grateful. – Dandy Aug 11 '12 at 02:11

1 Answers1

1

Man, your pattern sigfaults like crazy! Even comment of several hundred bytes ends with a crash.

It's a lot simpler to use preg_split() to split up the string, then use a counter to keep track of how deep you are. And when the depth is greater than one, you throw away the text. Here's the implementation:

$tokens = preg_split('#(</?blockquote.*?>)#s', $comment, -1, PREG_SPLIT_DELIM_CAPTURE); 
$outsideTokens = array();
$depth = 0;
for($token = reset($tokens); $token !== false; $token = next($tokens)) { 
    if($depth == 0) {
        $outsideTokens[] = $token;
    }
    $delimiter = next($tokens);
    if($delimiter[1] == '/') {
        $depth--;
    } else {
        $depth++;
    }
}
$comment = implode($outsideTokens);

The code should work even when the start tag contains attributes.

cleong
  • 7,242
  • 4
  • 31
  • 40
  • Wow! You just did it. I could not figure it out and you just did it! Congratulations and many many thanks! – Dandy Aug 11 '12 at 02:00