0

I have multi nested quotes in an HTML that look like this:

<div class="quote-container">
   <div class="quote-block">
      <div class="quote-container">
         <div class="quote-block">
         </div>
      </div>
      <div class="quote-container">
         <div class="quote-block">
         </div>
      </div>
      <div class="quote-container">
         <div class="quote-block">
         </div>
      </div>
   </div>
</div>

I need to search and remove quotes. I use expression:

<div class="quote-container">.*<div class="quote-block">.*</div>.*</div>

This works for single quotes. However there is a problem with multi nested quotes (example above).

My task is to search for:

<div class="quote-container">.*<div class="quote-block">

plus any string NOT containing

<div

and ending with

.*</div>.*</div>

I tried lookbehind and lookahead assertions like this:

<div class="quote-container">.*<div class="quote-block">.*(?!<div).*</div>.*</div>

but they don't work.

Is there a way to do my task? I need a perl expression I can use in TextPipe (I use it for forum parsing and later I do text-to-speech conversion).

Thanks in advance.

Jay Sullivan
  • 17,332
  • 11
  • 62
  • 86

3 Answers3

0

I think your problem is you are using greedy expressions .*.

Try replacing all .* with the non-greedy .*?

Bohemian
  • 412,405
  • 93
  • 575
  • 722
0

I would personally solve this problem by replacing the quotes out until there were no longer any quotes to replace out. There's really no way to handle this in one regex replace, what you'll need to do is something like:

psuedo-code:

html="... from your post ...";
do{
 newhtml=html
 newhtml=replace(
        '/<div class="quote-container">.*<div class="quote-block">.*</div>.*</div>/s',
        '',
        newhtml
    )
} while(newhtml!=html)
html=newhtml

this will handle all manner of nested quotes.

Sean Johnson
  • 5,567
  • 2
  • 17
  • 22
  • Replacing the quotes out until there are no longer any quotes eats post messages as well, because they are between quotes. Besides I need a regex, not a code like that. Thanks anyway. – user1483658 Jun 26 '12 at 19:01
0

Regexes are a poor choice to manipulate nested structures. I would write a specific parser for this problem (a simple stack based parser should suffice).

Soronthar
  • 1,601
  • 10
  • 10