0

I have a nested ul, li list in my html. How can i get the regex from ul to the end of ul node. In this example below i need to get 2 matches.

First one should be

<ul>
    <li>This is First List</li>
    <li>This is Second List</li>
    <ul>
        <li>This is Second UL First List </li>
        <li>This is Second UL Second List </li>
    </ul>
    <li>This is Third List</li> 
</ul>

and the second one should be

<ul>
        <li>This is Next List</li>
        <ul>
            <li>This is Test </li>
        </ul>
        <li>This is Third List</li> 
        <ul>
            <li>This is Test </li>
        </ul>
 </ul>

My HTML code:

<html>
<p> This is First Paragraph </p>
<ul>
    <li>This is First List</li>
    <li>This is Second List</li>
    <ul>
        <li>This is Second UL First List </li>
        <li>This is Second UL Second List </li>
    </ul>
    <li>This is Third List</li> 
</ul>
<p> This is Second Paragraph </p>   

<ul>
    <li>This is Next List</li>
    <ul>
        <li>This is Test </li>
    </ul>
    <li>This is Third List</li> 
    <ul>
        <li>This is Test </li>
    </ul>
</ul>
</html>
vamsivanka
  • 792
  • 7
  • 16
  • 36

1 Answers1

0

You can match nested constructs with .NET Balancing Groups. This feature basically adds the concept of a stack, which can be pushed/popped (<NestedUL>...) and (<-NestedUL>...), and then tested for at the end of the pattern via the last conditional which includes only an empty lookahead guaranteed to fail the pattern (?(NestedUL)(?!)):

var input =
    @"<html>
    <p> This is First Paragraph </p>
    <ul>
        <li>This is First List</li>
        <li>This is Second List</li>
        <ul>
            <li>nested list #1 inside first parent UL</li>
            <li>This is Second UL Second List </li>
        </ul>
        <li>This is Third List</li> 
    </ul>
    <p> This is Second Paragraph </p>   

    <ul>
        <li>This is Next List</li>
        <ul>
            <li>nested list #1 inside second parent UL</li>
        </ul>
        <li>This is Third List</li> 
        <ul>
            <li>nested list #2 inside second parent UL</li>
        </ul>
    </ul>
    </html>";
                var pattern = "<ul>(?:(?<NestedUL><ul>)|(?<-NestedUL></ul>)|.)+?(?(NestedUL)(?!))</ul>";
                var matches = Regex.Matches(input, pattern, RegexOptions.Singleline);
            }

*note the non-greedy quantifier on the repeated alternation +? - if this was greedy, the pattern would happily consume both ul's with a single match.

Scott Weaver
  • 7,192
  • 2
  • 31
  • 43