How to make a non-greedy regex for following?

Question

I have something like this: ...

<div class="viewport viewport_h" style = "overflow: hidden;" >
    <div id="THIS" class="overview overview_h">
        <ul>
                 <li>some txt to be captured</li>
                 <li>some txt to be captured</li>
                 <li>some txt to be captured</li>
        </ul>
        <div>
            " some text to be captured"
        </div>
    </div>
</div>
"some text not to be captured"
</div>
<div class="scrollbar_h">
<div class="track_h"></div>

...

I want to capture everything inside div with id=THIS. I'm using somthing like:

@<div class="viewport viewport_h" style = "overflow: hidden;" >\s*<div class="overview overview_h">\s*(?:<ul>)?([\s\d\w<>\/()="-:;‘’!,:]+)(?:</div>)+?@

The last (?:</div>)+? is to make it non-greedy for further "</div>" but that doesn't work and captuers all other following </div>. :(

I don't think a regex is the right answer to it. You should try to look at some XML parser since XML is a context free language (http://en.wikipedia.org/wiki/Context-free_grammar). Maybe this post can help you : http://stackoverflow.com/questions/590747/using-regular-expressions-to-parse-html-why-not — Michel Antoine, Jun 03 '15 at 11:34
Is it PHP or JavaScript? Use DOM/XPATH to get some specific tags, that will be most precise, readable and maintainable. — Wiktor Stribiżew, Jun 03 '15 at 11:36

Mazdak · Accepted Answer · 2015-06-03T11:50:13.237

As said in comments regex is not a proper way for parsing (?:X|H)TML documents.

Let consider your example one straight way for that is following regex :

<div[^>]*id="THIS"[^>]*>(.*?)</div>

DEMO

That will match following text :

    <ul>
             <li>some txt to be captured</li>
             <li>some txt to be captured</li>
             <li>some txt to be captured</li>
    </ul>
    <div>
        " some text to be captured"
    </div>

As you can see its not the proper result as you need another </div> so you need to count the open divs to be able to detect the closing divs that its all based on the language you are using.

Now in this case if you want to create a none-greedy ending dive you need to put a dot before + like following :

<div[^>]*id="THIS"[^>]*>(.*?)(</div>).+?

DEMO

Now it will match another </div> but still its hard for regex to detect the true result (its more complicated for another situation).and it's the reason that the proper way for parsing (?:X|H)TML is using a (?:X|H)TML Parser

How to make a non-greedy regex for following?

1 Answers1