-1

I have something like this: ...

<div class="viewport viewport_h" style = "overflow: hidden;" >
    <div id="THIS" class="overview overview_h">
        <ul>
                 <li>some txt to be captured</li>
                 <li>some txt to be captured</li>
                 <li>some txt to be captured</li>
        </ul>
        <div>
            " some text to be captured"
        </div>
    </div>
</div>
"some text not to be captured"
</div>
<div class="scrollbar_h">
<div class="track_h"></div>

...

I want to capture everything inside div with id=THIS. I'm using somthing like:

@<div class="viewport viewport_h" style = "overflow: hidden;" >\s*<div class="overview overview_h">\s*(?:<ul>)?([\s\d\w<>\/()="-:;‘’!,:]+)(?:</div>)+?@

The last (?:</div>)+? is to make it non-greedy for further "</div>" but that doesn't work and captuers all other following </div>. :(

UtkarshPramodGupta
  • 7,486
  • 7
  • 30
  • 54

1 Answers1

1

As said in comments regex is not a proper way for parsing (?:X|H)TML documents.

Let consider your example one straight way for that is following regex :

<div[^>]*id="THIS"[^>]*>(.*?)</div>

DEMO

That will match following text :

    <ul>
             <li>some txt to be captured</li>
             <li>some txt to be captured</li>
             <li>some txt to be captured</li>
    </ul>
    <div>
        " some text to be captured"
    </div>

As you can see its not the proper result as you need another </div> so you need to count the open divs to be able to detect the closing divs that its all based on the language you are using.

Now in this case if you want to create a none-greedy ending dive you need to put a dot before + like following :

<div[^>]*id="THIS"[^>]*>(.*?)(</div>).+?

DEMO

Now it will match another </div> but still its hard for regex to detect the true result (its more complicated for another situation).and it's the reason that the proper way for parsing (?:X|H)TML is using a (?:X|H)TML Parser

Mazdak
  • 105,000
  • 18
  • 159
  • 188